+ All Categories
Home > Documents > Text and Multimedia Languages and...

Text and Multimedia Languages and...

Date post: 30-Mar-2018
Category:
Upload: phammien
View: 230 times
Download: 2 times
Share this document with a friend
64
Text and Multimedia Languages and Properties
Transcript

Text and MultimediaLanguages and Properties

Overview

• Text: main form of communicatingknowledge

• Document: a single unit of information• A document has

– syntax:– structure:– semantics: specified by the author– presentation style: specifies how it should

be displayed or printed

(dictated by the application or bythe person who created it)}

Characteristics of a Document

Document

SyntaxText + Structure+ Other Media

Semantics

Presentation Style

Overview

The syntax of a document can express– structure– presentation style– semantics– external actions

• This syntax can be– implicit– expressed in a simple declarative language– expressed in a programming language

Metadata

Metadata, ‘data about the data’, isinformation on– the organization of the data– the various data domains– the relationship between them

Metadata Examples

• Database management system:– name of the relations– fields or attributes of each relation– domain of each attribute

• Text:– author– date of publication– source of publication– document length– document genre

Descriptive Metadata

Descriptive Metadata: metadata that is– external to the meaning of the document– pertain how the document was created

Example: the Dublin Core MetadataElement Set: proposes 15 fields todescribe a document

Semantic Metadata

Semantic Metadata: metadata that– characterizes the subject matter found in the

document’s contents– is associated with a wide number of

documents– is increasing in its availability

Example:– All books published in the USA are assigned

Library of Congress subject codes

Semantic Metadata

Example:– Many journals require author-

assigned key terms (from a closedvocabulary of relevant terms)

– topical metadata in biomedicalarticles within the MEDLINE systemare disease, anatomy,pharmaceuticals, etc.

Metadata for Web Document

• In the web, metadata can be used for:– cataloging: a popular format is BibTeX– content rating– intellectual property rights– digital signatures– privacy levels– applications to electronic commerce

Metadata for Web Document

Resource Description Framework (RDF):new standard for Web metadata– provides interoperability between

applications– allows the description of Web resources to

facilitate automated processing of theinformation

– consists of a description of nodes andattached attribute/value pairs

Metadata for Web Document

node: Web resource, Uniform ResourceIdentifier (URI)

attribute: properties of nodesvalue: text strings or other nodes

Text

• text is coded in binary digits for computer• First coding schemes: EBCDIC, ASCII

– use seven bits for each symbol• Later, ASCII was standardized to eight bits

(ISO-Latin)– accommodate several languages– including accents and diacritical marks

• Unicode (ISO 10616) uses 16-bit code– for oriental languages

Formats

• In the past, IR systems convert adocument to an internal formatdisadvantages:– original application related to the document

is no longer useful– contents of a document cannot be changed

• Current IR system uses filters– might not be possible with proprietary or

non-public formats

Formats

• Full ASCII syntax: TeX• Binary syntax: Word, WordPerfect,

FrameMaker• Rich Text Format (RTF):

– used by word processors– has ASCII syntax– developed for document interchange

Formats

• Portable Document Format (PDF) andPostscript– developed for displaying and printing

documents• Multipurpose Internet Mail Exchange

(MIME)– interchange formats– used to encode electronic mail

Formats

• Compressed text:– Compress (Unix)– ARJ (PCs)– ZIP (gzip-Unix, Winzip-Windows), etc.

• Conversion tools: convert binary files(compressed text) to ASCII text fortransmission:– uuencode/uudecode– binhex

Information Theory

• the distribution of symbols related to information(or semantics) in written text

• entropy: used to capture information content (orinformation uncertainty)– σ: symbols the alphabet has– pi: probability of each symbol appearance (the

symbol frequency over the total number of symbols)– E: the entropy of this text

!

E = " pi log2 pii=1

#

$

Entropy

• the σ symbols of the alphabet are coded inbinary → the entropy is measured in bits

• example: for σ = 2,– the entropy is 1 if both symbols appear the same

number of times– the entropy is 0 if only one symbol appears

• the text model determines probabilities pi andamount of information in a text

Modeling Natural Language

• text is composed of symbols from a finitealphabet

• symbols can be divided into two subsets– symbols that separate words– symbols that belong to words

• A simple model to generate text is thebinomial model

• In natural language, these symbols are notuniformly distributed → each symbol dependson previous symbol

Modeling Natural Language

• a finite-context or Markovian model canbe used to compute this dependency

• more complex models: finite-statemodels and grammar models

Distribution of the Frequencies

• Zipf’s Law is used to model the distribution ofword frequencies in the text– the frequency of the i-th most frequent word is 1/iθ

times that of the most frequent word– in a text of n words with a vocabulary of V words,

the i-th most frequent word appears n/(iθHV(θ))times

• HV(θ) is the harmonic number of order θ of V

– θ depends on the text, usually θ > 1 (1.5-2.0)

!

HV (") =1

j"

j=1

V

#

Distribution of Words

A simple model: consider each word appearsthe same number of time in every document

A better model: a negative binomial distribution– the fraction of documents containing a word

k times is

where p and α are parameters (depend on theword and the document collection)

!

F(k) = Ck

"+k#1pk(1+ p)

#"#k

Document Vocabulary

• Heaps’ Law is used to predict thegrowth of the vocabulary size in naturallanguage text

V = Knβ = O(nβ)• V: vocabulary size of a text of n words• K, β: free parameters - depend on text

10 ≤ K ≤ 100; 0 ≤ β ≤ 1• See Figure 6.2

Figure 6.2

V

Words Text size

F

Average Length of Words

• Heaps’ law:– the length of the words in the vocabulary

increases logarithmically with the text size• In practice:

– the average length of the words is constant• Finite-state model

– the space character has probability close to 0.2– the space character can’t appear twice in a row– there are 26 letters

Similarity Models

• similarity is measured by– a distance function: Hamming distance– edit or Levenshtein distance– longest common subsequence (LCS)

• a distance function should– be symmetric: arguments order is important– satisfy the triangular inequality

distance(a,c) ≤ distance(a,b)+distance(b,c)

Similarity Models

• extending similarity to documents is done by– consider lines as single symbols and compute the

longest common sequence of lines between twofiles (diff command in Unix)

problems:– time consuming– does not consider lines that are similar

• The second problem can be fixed by– taking a weighted edit distance between lines– computing the LCS over all the characters

Document Similarity

Other solutions include• extract fingerprints of the documents and

compare them, or find large repeated pieces• use visual tools to see document similarity:

Dotplot draws a rectangular map where– both coordinates are file lines– the entry for each coordinate is a gray pixel that

depends on the edit distance between theassociated lines

Markup Languages

• Markup: extra textual syntax used to describe– formatting actions– structure information– text semantics– attributes, etc.

ex. the formatting commands of TeX• formal markup languages are much more

structured• the marks are called tags (initial+text+ending)• Samples markup languages: SGML, HTML, XML

SGML

• Standard Generalized Markup Language(ISO 8879): a metalanguage for tagging text– developed by a group led by Goldfarb– based on earlier work done at IBM– provides the rules for defining a markup language

based on tags• an SGML document is defined by

– a description of the structure of the document– the text marked with tags describing the structure

SGML

• each instance of SGML includes a descriptionof the document structure called a documenttype definition

• the document type definition is used to– describe and name the pieces that a document is

composed of– define how those pieces relate to each other

• part of the definition can be specified by anSGML document type declaration (DTD)

SGML

• SGML cannot formally express– semantics of elements– attributes– application conventionsonly informal form (comment) can be done

• SGML tag are denoted by anglebrackets <>– <tagname> text </tagname>

TEI

• One important use of SGML is in TEI(Text Encoding Initiative), a cooperativeproject started in 1987– to generate guidelines for the preparation

and interchange of electronic texts forscholarly research and for industry

– one of the most used formats is TEI Lite

HTML

• HyperText Markup Language (HTML):– is an instance of SGML– created in 1992, the latest version is 4.0– is being extended to solve its limitation– HTML tags follow SGML conventions– HTML tags include format directives– other media can be embedded in HTML

documents

HTML

• HyperText Markup Language (HTML):– supports backward and forward

compatibility• Cascade Style Sheets (CSS)

– offer a powerful and manageable way tocreate visual effects of HTML pages

HTML 4.0

• specified in strict, transitional andframeset

• Strict: only worries about non-presentational markup, leaving all thedisplay information to CSS

• Transitional: uses all the presentationfeatures for pages

• Frameset: used when frames is used

HTML Limitation

• HTML does not– allow users to specify their own tags or

attributes– support the specification of nested

structures– support the kind of language specification

that allows consuming applications tocheck data for structural validity onimportation

XML

• eXtensible Markup Language (XML)– is a simplified subset of SGML– is not a markup language– is a metalanguage capable of containing

markup languages– allows a human-readable semantic markup

(also machine-readable)– is easier to develop and deploy new

specific markup

XML

• eXtensible Markup Language (XML)– enables automatic authoring, parsing, and

processing of networked data– does not have many restrictions imposed

by HTML– imposes a more rigid syntax on the markup– distinguishes upper and lower case– is easier to be parsed without knowledge of

the tags (all attribute values must bebetween quotes)

XML

• eXtensible Markup Language (XML)– allows users to define new tags, more

complex structures– has data validation capabilities

Recent Uses of XML

• Mathematical Markup Language(MathML): two sets of tags– for presentation of formulas– for the meaning of mathematical

expressions

Recent Uses of XML

• Synchronized Multimedia Integrationlanguage (SMIL):- a declarative language for scheduling

multimedia presentations in the Web- the position and activation time of different

objects can be specified• Resource Description Format (RDF):

used as metadata information for XML

Multimedia

Multimedia: applications that handledifferent types of digital data originatingfrom distinct types of media

Most common types of media are- text, sound, images, video (animated

sequence of images)The differences among these media types

- volume, format, processing requirements

Image Formats

Several formats for images:• direct representations of a bit-mapped display

- consume too much space: XBM, BMP, PCX• compressed:

– Graphic Interchange Format (GIF)– Joint Photographic Experts Group (JPEG)

• Tagged Image File Format (TIFF):– exchange documents between different

applications and different computer platforms– has fields for metadata and support compression

Image Formats

Several formats for images:• True-vision Targa image file (TGA):

– associated with video game boards• Other formats:

– fax (bi-level image formats): JBIG– fingerprints (highly accurate and compressed):

WSQ– satellite (large resolution and full-color images)– Portable Network Graphics (PNG)

Audio Formats

Several formats for small piece of digital audio:– AU: created by Sun Microsystems, one of the

most common formats on the Web– MIDI: standard format to interchange music

between electronic instruments and computers– WAVE: the native sound format within the

Windows environment, one of the most commonon the Web

Formats for audio libraries– RealAudio or CD formats

Animation Formats

for animations or moving images:– Moving Pictures Expert Group (MPEG):

related to JPEG– AVI: includes compression (CinePac)– FLI: originally developed by Autodesk, Inc.,

play back faster than MPEG for computergenerated animations at 640x480

– QuickTime: developed by Apple

Textual Images

Very important in office systems– images of documents that contain mainly

typed or typeset text– obtained by scanning the documents– usually for archiving purposes

• Large portion of a textual image is text– can be used for retrieval purpose– allow efficient compression

Textual Images

• further compression can be achieved by– extracting the different text symbols or

marks from the image– building a library of symbols– representing each one by a position in the

library

Retrieval of Textual Images

• associated a set of keywords at creationtime or added to the database

• use OCR to extract the text of theimage

• use the symbols extracted from theimages as basic units to combine imageretrieval techniques with sequenceretrieval techniques

Graphics and Virtual Reality

For three-dimensional graphics• Computer Graphics Metafile (CGM)

standard (ISO 8632):– defined for the open interchange of

structured graphical objects and associatedattributes

– specifies a two-dimensional datainterchange standard

Graphics and Virtual Reality– allows graphical data to be stored and

exchanged between graphics devices,applications, and computer systems(device-independent)

– can represent vector graphics and rasterformat

– support a collection of elements, calledmetafile

– specifies which elements are allowed tooccur in which positions in a metafile

Graphics and Virtual Reality

For three-dimensional graphics• Virtual Reality Modeling Language (VRML,

ISO/IEC 14772-1):– file format for describing interactive 3D objects and

worlds– is a subset of the Silicon Graphics OpenInventor

file format– intended to be a universal interchange format for

integrated 3D graphics and multimedia

HyTime

The Hypermedia/Time-based StructuringLanguage (HyTime) is a standard(ISO/IEC 10744)– defined for multimedia documents markup– is an SGML architecture that specifies the

generic hypermedia structure of documents– Allows DTDs to be written for individual

document models

HyTime

The hypermedia concepts directlyrepresented by HyTime include– complex locating of document objects– relationships (hyperlinks) between

document objects– numeric, measured associations between

document objects

HyTime

The HyTime architecture has three parts:• The base linking and addressing

architecture:addresses the syntax and semantics of hyperlinks

• The scheduling architecture (derivedfrom the base architecture):

defines the abstract representation of complexhypermedia structures (including music andinteractive presentations)

HyTime

• The rendition architecture (an applicationof the scheduling architecture):

defines a general mechanism for defining thecreation of new schedules from existing schedules(by applying special ‘rendition rules’ of differenttypes)

Applications of HyTimeStandard Music Description Language (SMDL)

– an architecture for the representation of musicinformation

– supporting multimedia time sequencinginformation

Metafile for Interactive Documents (MID)– a common interchange structure– based on SGML and HyTime– takes data from various authoring systems and

structures it for display on different presentationsystems (with minimal human intervention)

Trends and Research Issues

• The main trend is the convergence andintegration of the different efforts (the Web isthe main application)

• ODA (Open Document Architecture):– designed to share documents electronically without

losing control over the content, structure, and layoutof those documents

– defines a logical structure, a layout and the content– an ODA file can be formatted, processable, or

formatted processable

Trends and Research Issues

• Formatted files– cannot be edited– have information about content and layout

• Processable files– can be edited– have content and logical information

• Formatted processable files– have everything

Trends and Research Issues

• Recent developments include:– the document object model (DOM)– integration between VRML and Dynamic

HTML– Integration between the Standard

Exchange for Product Data format (STEP,ISO 10303) and SGML

– Effort to convert MARC to SGML bydefining DTD as well as MARC to XML

Trends and Research Issues

• Recent developments include:– CGM: developing a new encoding which

can be parsed by XML– Several new proposals such as

• SDML (Signed Document Markup Language)• VML (Vector Markup Language)• PGML (Precision Graphics Markup Language)

Taxonomy of Web Languages

HTMLTEI Lite

SGML

XML

NextGeneration

HTML

HyTime

RDF MathML SMIL

DSSL

XSL

CSS

Style sheets

Metalanguages

Languages


Recommended