© Maria Indrawan Monash University 2003 1 CSE3201/4500 Information Retrieval Systems Maria Indrawan...

Post on 21-Dec-2015

216 views 2 download

Tags:

transcript

© Maria Indrawan Monash University 2003

1

CSE3201/4500

Information Retrieval Systems

Maria Indrawan

C4.26, 9903-1916maria.indrawan@infotech.monash.edu.au

© Maria Indrawan Monash University 2003

2

Type of Data

structured non- structured

XML documents

relational database

free text, search engine

• data representation

• query formulation

• matching

© Maria Indrawan Monash University 2003

3

Introduction

• What will I learn in this unit?– how to manage data that cannot be effectively

handled by a relational DBMS.• XML documents

• Text (free text)

• There will be no SQL in this unit.

© Maria Indrawan Monash University 2003

4

Objectives

• On the completion of this unit, you will (hopefully!) be able to: Understand the difference nature of information (structured, semi-

structured, unstructured) and their associated issues when dealing with information retrieval.

understand the XML technologies and their role in Information Retrieval.

Be able to demonstrate the ability to create and manipulate XML documents.

Understand the design issues and various approaches to the development of text databases.

© Maria Indrawan Monash University 2003

5

Prerequisite Knowledge

• Relational database concepts, such as SQL, indexing.

• Basic UNIX commands, eg file, directory manipulation commands.

• HTML.

• Basic level of Maths (year-12 level).

© Maria Indrawan Monash University 2003

6

Assessment

• There are different assessments

for CSE3201 and CSE4500.• Undergraduate students =>

CSE3201

• Masters students => CSE4500

© Maria Indrawan Monash University 2003

7

CSE4500 Assessment

• Component A: – Assignment 1 – XML Schema

10% (week 6)– Assignment 2 - XSLT

15% (week 9)– Unit Test, - XML, XSLT

15% (week 10)

© Maria Indrawan Monash University 2003

8

CSE4500 Assessment

• Component B– Assignment 3 Research Paper 10%

(week 12)

• Component C:– Exam 50%

© Maria Indrawan Monash University 2003

9

CSE3201 Assessment

• Component A: – Assignment 1 – XML Schema

10% (week 6)– Assignment 2 - XSLT

15% (week 9)– Unit Test, - XML, XSLT

15% (week 10)

© Maria Indrawan Monash University 2003

10

CSE3201 Assessment

• Component B– Unit Test on text retrieval

10% (week 12)

• Component C:– Exam

50%

© Maria Indrawan Monash University 2003

11

Assessment Rules

• The result of the unit test will determine the final grade for component A as follow:

Unit Test Maximum grade for Component A

Fail Pass Pass Credit Credit Distinction Distinction High Distinction

© Maria Indrawan Monash University 2003

12

Assessment Rules

• In order to pass this unit you must attain:– 50% overall and– at least 40% of the available marks in

each component A, B and C.

© Maria Indrawan Monash University 2003

13

Textbook

Prescribed:XML:How To Program (1st ed)Deitel, H.M. Deitel P.J. Nieto, TR. Lin, T. and Sadhu, PPrentice Hall

Recommended:Professional XML, 2nd Ed, WROX Publisher.Beginner XML, WROX Publisher.XML SchemaEric Van Der Vlist, O’Reilly Publishing.

© Maria Indrawan Monash University 2003

15

Plagiarism/Cheating

• Please read all the necessary university materials on cheating/plagiarism (listed in the unit guide).

© Maria Indrawan Monash University 2003

16

Computing Facilities

• Quota system

• Acceptable policy– http://www.infotech.monash.edu.au/

myfit/students/student_labinfo_rules_netusage.cfm

– http://www.adm.monash.edu.au/unisec/pol/itec12.html

© Maria Indrawan Monash University 2003

17

Being Resourceful and Independent

• I have a question on …

– Read the textbook or reading list.

– Explore additional materials, eg W3C.

– Ask my tutor.

– Ask my lecturer.

• Can I ask my tutor/helpdesk to find the bugs in my work?

– No.

• Will the solution to the tutorial exercises be published?

– No. Students are encourage to discuss their work with the tutors.• Will study the lecture notes be sufficient for this unit?

– No. Students need to read the textbook and additional reading list.

© Maria Indrawan Monash University 2003

18

Basic XML

© Maria Indrawan Monash University 2003

19

Objectives

• Be able to:– Understand XML technologies and their roles.– Understand different components of an XML

document.– Create a well-form XML document.

© Maria Indrawan Monash University 2003

20

What is XML?

• XML=ExtensibleMarkup Language.• Markup Languages:

– HTML– SGML

• Utilise the mark ups to define the – structure– semantics => to a certain level.

• WWW Consortium(W3C) recommendation– www.w3c.org

© Maria Indrawan Monash University 2003

21

XML vs HTML

HTML XML

• tags define the presentation layout<p> CSE3201 </p>

<p> Information Retrieval </p>

tags define the structure and the meaning of the data<unit>

<unitCode> CSE3201

</unitCode>

<unitName> Information

Retrieval </unitName>

</unit>

© Maria Indrawan Monash University 2003

22

Why XML?

• Distributed applications need to share data.– plain text– structure and the meaning of the data are tightly

defined.

• Delivery of data to multi-devices– Separation of data and presentation.

© Maria Indrawan Monash University 2003

23

XML Document – an Example

<bookshop><book><title> Harry Potter and the

Sorcerer’s Stone</title><author> <initials>J.K</initials> <surname>

Rowling</surname></author><price value=“$16.95”></price></book>…</bookshop>

bookshop

book

title

book

author

initials surname

price

value

© Maria Indrawan Monash University 2003

24

XML Technologies

• DTD/Schema– definition of XML structures

• XSL (XSLT and XSL-FO)– presentation

• XPath– locating nodes

• Xlink, Xpointer– linking

• DOM and SAX– APIs to manipulate XML

© Maria Indrawan Monash University 2003

25

XML Parser

• Required to read and manipulate XML documents.

• Read the XML documents as a plain text and transform it into a data structure, typically tree, in the memory.

• The applications, such as web browser, access the data structure and process the data according to their objectives.

• Example: msxml

© Maria Indrawan Monash University 2003

26

XML Usage

• SOAP (simple object access protocol)

• Microsoft BizTalk Server

• WSDL and UDDI in Web Services

• Semantic Web

© Maria Indrawan Monash University 2003

27

XML Issues

• Performance– text processing vs binary processing

• Security

© Maria Indrawan Monash University 2003

28

XML Document – Basic Components

• Elements.

• Attributes.

• Character and Entity References.

• Character Data (CDATA).

• Processing Instruction.

• Comments.

© Maria Indrawan Monash University 2003

29

Elements

Root Element (compulsory)

Branch Elements

Leaf Element

bookshop

book

title

book

author

initials surname

price

value

attribute

© Maria Indrawan Monash University 2003

30

Element

• The basic building block of XML markups.• It may contains:

– Text– Other elements (child elements)– Attributes– Character Data– Other markup, eg comments

• Delimited with a start-tag and an end-tag.• Element can be empty.• The end-tag CANNOT be omitted as in HTML.

• Each tag must consist a valid element type name.

© Maria Indrawan Monash University 2003

31

Element’s Name

• Element’s Name (Tag’s name) is CASE SENSITIVE.– <BOOK> <Book><book>

• Trailing space is legal but will be ignored– <BOOK > = <BOOK>

© Maria Indrawan Monash University 2003

32

Empty Element

• Has no content.

• May be associated with attribute.

• Example: <img src=‘logo.png’></img>

can be abbreviated into

<img src=‘logo.png’/>

© Maria Indrawan Monash University 2003

33

XML Document – Basic Components

• Elements.

• Attributes.

• Character and Entity References.

• Character Data (CDATA).

• Processing Instruction.

• Comments.

© Maria Indrawan Monash University 2003

34

Attributes

• Information regarding the element.“If elements are ‘nouns’ of XML then attributes

are its ‘adjective’.• <tagname attribute_name=“attribute_value”>

<book>

<title> Harry Potter</title>

</book>

<book title=“Harry Potter”>

</book>

© Maria Indrawan Monash University 2003

35

Attributes vs Element

• Determine by the semantic contents.

• Attributes are characteristics of an element.

<book>

<title> Harry Potter</title>

</book>

<book title=“Harry Potter”>

</book>

© Maria Indrawan Monash University 2003

36

XML Document – Basic Components

• Elements.

• Attributes.

• Character and Entity References.

• Character Data (CDATA).

• Processing Instruction.

• Comments.

© Maria Indrawan Monash University 2003

37

Character References

• Use to display characters that are not supported by the input device (keyboard). – entering £ using US-ASCII keyboard.

• Format: &#NNNNN; or &#xXXXX; – N decimal – X hexadecimal

• Example: $ => &#36; OR &#x24

© Maria Indrawan Monash University 2003

38

Entity References

• Entities may be defined and used for:– Representing character used in mark-up

• &lt == “<“

• &amp == “&”

– String • &IR == Information Retrieval

• Predefined entities: &lt, &gt, &quot, etc

© Maria Indrawan Monash University 2003

39

XML Document – Basic Components

• Elements.

• Attributes.

• Character and Entity References.

• Character Data (CDATA).

• Processing Instruction.

• Comments.

© Maria Indrawan Monash University 2003

40

Character Data

• To escape blocks of text containing characters which would otherwise be recognized as markup.

• <![CDATA[…]]>• <![CDATA[<greeting>Hello,

world!</greeting>]]>

© Maria Indrawan Monash University 2003

41

Character Data(2)

<example>

<![CDATA[&Warn;-&Disclaimer;&lt;&copy 2001; &PM;&gt;]]>

</example>

<example>

&amp;Warn;-&amp;Disclaimer;&amp;lt;&amp;copy 2001; &amp;PM; &amp;gt>

</example>

© Maria Indrawan Monash University 2003

42

XML Document – Basic Components

• Elements.

• Attributes.

• Character and Entity References.

• Character Data (CDATA).

• Processing Instruction.

• Comments.

© Maria Indrawan Monash University 2003

43

Processing Instruction(PI)

• Processing instructions (PIs) allow documents to contain instructions for applications.

• <?target … instruction … ?>

• Target is used to identify the application or other object to which the PI is directed.

• <?xml-stylesheet href=“mystyle.css” type=“text/css”>

© Maria Indrawan Monash University 2003

44

XML Document – Basic Components

• Elements.

• Attributes.

• Character and Entity References.

• Character Data (CDATA).

• Processing Instruction.

• Comments.

© Maria Indrawan Monash University 2003

45

Comments

• Syntax: <!–- comment text -->

• Comments cannot be used within element tags.

<tag>… some content … <tag <!– it is illegal -->>

• Comments may never be nested.<!– Comments cannot <!– be nested --> like this -->

© Maria Indrawan Monash University 2003

46

Structure of XML Document

• XML document has to be well-formed.– Conform to syntax requirements– Conform to a simple container structure

• Common structure of XML document:– Prolog– Body– Epilog

© Maria Indrawan Monash University 2003

47

Prolog

• Includes:– XML Declaration

<?xml version=“1.0” encoding=‘utf-8’ standalone=“yes”>

• Version is mandatory, encoding and standalone are optional

– Document Type Declaration<!DOCTYPE • It is not DTD=Document Type Definition

• A simple well-formed XML does not need it.

– Schema declaration

© Maria Indrawan Monash University 2003

48

Body & Epilog

• Body– Contains 1 or more elements– The “contents”

• Epilog– Hardly used– Can be used to identify end of document

© Maria Indrawan Monash University 2003

49

Well-formed XML Document

• Contains a root element.

• valid tag’s name.

• no overlapping tags.