+ All Categories
Home > Documents > 1/25 Malcolm Clark Supervisors: Professor Patrik O'Brian Holt Dr Ian Ruthven Genre Analysis of...

1/25 Malcolm Clark Supervisors: Professor Patrik O'Brian Holt Dr Ian Ruthven Genre Analysis of...

Date post: 17-Dec-2015
Category:
Upload: dora-bradford
View: 219 times
Download: 2 times
Share this document with a friend
Popular Tags:
30
1/25 Malcolm Clark Supervisors: Professor Patrik O'Brian Holt Dr Ian Ruthven Genre Analysis of Structured E-mails for Corpus Profiling Workshop on Corpus Profiling for NLP/IR
Transcript
Page 1: 1/25 Malcolm Clark Supervisors: Professor Patrik O'Brian Holt Dr Ian Ruthven Genre Analysis of Structured E-mails for Corpus Profiling Workshop on Corpus.

1/25

Malcolm ClarkSupervisors:

Professor Patrik O'Brian Holt Dr Ian Ruthven

Genre Analysis of Structured E-mails for Corpus Profiling

Workshop on Corpus Profiling for NLP/IR

Page 2: 1/25 Malcolm Clark Supervisors: Professor Patrik O'Brian Holt Dr Ian Ruthven Genre Analysis of Structured E-mails for Corpus Profiling Workshop on Corpus.

Malcolm Clark 2/25

Presentation Outline

• Introduction• The Problems• Information Retrieval (IR), Genre and

Perception• Experiment – Research Questions,

Setup, How do People use Textual Features?

• Conclusions• Contributions and Implications• Future Work

Page 3: 1/25 Malcolm Clark Supervisors: Professor Patrik O'Brian Holt Dr Ian Ruthven Genre Analysis of Structured E-mails for Corpus Profiling Workshop on Corpus.

• Focuses IR and cognitive psychology.

• Corpuses contain ‘exemplar’ documents called genres useful for profiling corpora

• E-mail exchanges have socially constructed communicative behaviours which exist to improve the efficiency of a community of practice and for profiling corpora.

• Investigate these types of genres and how people use emails in terms of genre and perception for filtering.

Malcolm Clark 3/25

Introduction

Malcolm Clark

Page 4: 1/25 Malcolm Clark Supervisors: Professor Patrik O'Brian Holt Dr Ian Ruthven Genre Analysis of Structured E-mails for Corpus Profiling Workshop on Corpus.

• Identifying genres for profiling corpus• Filter correct types of documents to user

by genre:• E-mail filtering• Understanding user tasks

• Rapidly understand a text without the necessity for parsing the whole document?

4/25

The Problems

Malcolm Clark

Page 5: 1/25 Malcolm Clark Supervisors: Professor Patrik O'Brian Holt Dr Ian Ruthven Genre Analysis of Structured E-mails for Corpus Profiling Workshop on Corpus.

Malcolm Clark 5/25

The Project Examines:• The value of structure.

• How form or layout is perceived in structured texts?

• Constructivist (recognition) and ecological approaches (action afforded ) or are they both used?

• If and how the objects of a community of practice (COP) can be comprehended and exploited?

• How readers react to genre features in document collections.

Page 6: 1/25 Malcolm Clark Supervisors: Professor Patrik O'Brian Holt Dr Ian Ruthven Genre Analysis of Structured E-mails for Corpus Profiling Workshop on Corpus.

6/25

Information Retrieval

Division of IR into computer science lab experiments vs ‘user-orientated’ social studies

Järvelin(2006)Malcolm Clark

Page 7: 1/25 Malcolm Clark Supervisors: Professor Patrik O'Brian Holt Dr Ian Ruthven Genre Analysis of Structured E-mails for Corpus Profiling Workshop on Corpus.

Malcolm Clark 7/25

Genre – Background

Purpose

TYPICAL GENRE

Form

Structural Features

Comm’sMedium

Language or Symbol System

TopicsThemes Topics

Arguments Discourse Structure

Topics

Communicative purpose

Formality, specialised

vocab

Readily observable

features

Malcolm Clark

Orlikowski and Yates 1994

Page 8: 1/25 Malcolm Clark Supervisors: Professor Patrik O'Brian Holt Dr Ian Ruthven Genre Analysis of Structured E-mails for Corpus Profiling Workshop on Corpus.

8/25

Corpus - Genre Example from E-mail-call for papers

Titles:

Topics (list)

Header: Title etcAbstract

Dates and submission

Malcolm Clark

Page 9: 1/25 Malcolm Clark Supervisors: Professor Patrik O'Brian Holt Dr Ian Ruthven Genre Analysis of Structured E-mails for Corpus Profiling Workshop on Corpus.

What ? Social institutions/sites. When? Human ‘agents’ draw on genre rules to engage in organizational communication.How? Produced, reproduced, or modified.

But how are they perceived and used?

9/25

Genre – What are Communities of Practice (COP)?

Malcolm Clark

Page 10: 1/25 Malcolm Clark Supervisors: Professor Patrik O'Brian Holt Dr Ian Ruthven Genre Analysis of Structured E-mails for Corpus Profiling Workshop on Corpus.

Two prominent fields in perception research:

10/25

Human Perceptual Systems

Malcolm Clark

Final goal?

Recognition Action

Page 11: 1/25 Malcolm Clark Supervisors: Professor Patrik O'Brian Holt Dr Ian Ruthven Genre Analysis of Structured E-mails for Corpus Profiling Workshop on Corpus.

• How human beings use genres features and what do they perceive?

• How can genre categorization be performed by using current skimming methods?

• How do genres evolve in communities of practice (i.e. e-mail etc)?

• How are the document genres and structural attributes used?

11/25

Experiment Pilot - Research Questions:

Malcolm Clark

Page 12: 1/25 Malcolm Clark Supervisors: Professor Patrik O'Brian Holt Dr Ian Ruthven Genre Analysis of Structured E-mails for Corpus Profiling Workshop on Corpus.

By eye tracking i.e. the position and movement of

the eye:

• Collect and analyse the empirical data produced by experiments in e-mail community of practice.

• Locating the strategies and features for profiling corpora - e.g. centred blocks of text, invariant cues. Taking into account: features, strategies etc.

• How do humans view genre?

12/25

Experiment Pilot - How do People Use Texts?

Malcolm Clark

Page 13: 1/25 Malcolm Clark Supervisors: Professor Patrik O'Brian Holt Dr Ian Ruthven Genre Analysis of Structured E-mails for Corpus Profiling Workshop on Corpus.

13/25

Experiment Pilot

Malcolm Clark

Page 14: 1/25 Malcolm Clark Supervisors: Professor Patrik O'Brian Holt Dr Ian Ruthven Genre Analysis of Structured E-mails for Corpus Profiling Workshop on Corpus.

• Method - 4 x 16 image blocks (4 genres in each two blocks).

• Measurements• Amount of genres id’d correctly - purpose• Structure vs Non-structure form - form• Identification of genre response time - form• Strategies and distinguishing features - purpose

and form• Variables

• Purpose/type of genre • Form in 4 representations………………………..

14/25

Pilot - Setup

Malcolm Clark

Page 15: 1/25 Malcolm Clark Supervisors: Professor Patrik O'Brian Holt Dr Ian Ruthven Genre Analysis of Structured E-mails for Corpus Profiling Workshop on Corpus.

15/25

CFP - Content AND Structure

Malcolm Clark

Page 16: 1/25 Malcolm Clark Supervisors: Professor Patrik O'Brian Holt Dr Ian Ruthven Genre Analysis of Structured E-mails for Corpus Profiling Workshop on Corpus.

16/25

CFP – Structure and No Content

Malcolm Clark

Page 17: 1/25 Malcolm Clark Supervisors: Professor Patrik O'Brian Holt Dr Ian Ruthven Genre Analysis of Structured E-mails for Corpus Profiling Workshop on Corpus.

17/25

CFP – Content No Structure

Malcolm Clark

Page 18: 1/25 Malcolm Clark Supervisors: Professor Patrik O'Brian Holt Dr Ian Ruthven Genre Analysis of Structured E-mails for Corpus Profiling Workshop on Corpus.

18/25

CFP – No Content AND No Structure

Malcolm Clark

Page 19: 1/25 Malcolm Clark Supervisors: Professor Patrik O'Brian Holt Dr Ian Ruthven Genre Analysis of Structured E-mails for Corpus Profiling Workshop on Corpus.

• Task and procedure

• Shown 64 images

• Vocally Id each image.

• Eyetracker records features and strategies used.

• Data recorded

• X/Y location saccades and fixations.

• Features and strategies

• Desktop video recording – Wink

• Timed and vocal responses

19/25

Setup

Malcolm Clark

Page 20: 1/25 Malcolm Clark Supervisors: Professor Patrik O'Brian Holt Dr Ian Ruthven Genre Analysis of Structured E-mails for Corpus Profiling Workshop on Corpus.

• Amount of genres id’d correctly-purpose • 11.5 per block out of 16. • Un-structured vs structure 41.6%/72.9%• Orig (87.5%),Orig no content (77%), content no

struc (68%), non 27%• Structure vs Non-form - av. response time (sec): 2.22 vs 2.72HOW WAS IT DONE?????• Clues to strategies:

• skimmed shape - left (sem) / centred (cfp) aligned and blocks of text/numerics • No structure/no struc or content: wide spirals of

scanning behaviour poss looking keywords?

20/25

Results after 5 Participants

Malcolm Clark

Page 21: 1/25 Malcolm Clark Supervisors: Professor Patrik O'Brian Holt Dr Ian Ruthven Genre Analysis of Structured E-mails for Corpus Profiling Workshop on Corpus.

21/25

Results – Distinguishing features

Malcolm Clark

Genre Features

CFP Dates, centered blocks

Cinema Block numerical content

ITS Inconclusive (participants ignore them?)

Lib List book (s) info at bottom

Nl Paragraph/summary of item then URL

Ord Left alignment/currency

Sem Inconclusive

Spam Keywords LOTTO/address and uppercase emboldened text

Page 22: 1/25 Malcolm Clark Supervisors: Professor Patrik O'Brian Holt Dr Ian Ruthven Genre Analysis of Structured E-mails for Corpus Profiling Workshop on Corpus.

• Genre largely overlooked but momentum is building.

• Our approach is useful for filtering e-mails/id features for characterising datasets

• Purpose and form very useful for using texts.• Clues to perception processes found but need

to add familiarity to the mix.• Train machine to emulate human behaviour

and understand textual input without reading whole text?

22/25

Conclusions

Malcolm Clark

Page 23: 1/25 Malcolm Clark Supervisors: Professor Patrik O'Brian Holt Dr Ian Ruthven Genre Analysis of Structured E-mails for Corpus Profiling Workshop on Corpus.

• Development of a language/perception theory/framework of:

• How people use different types of texts. • Modelling user tasks and behaviour in

relation to genre and perception.

• Extend laboratory IR/user-orientated IR approach

• From: algorithms and machines.• To: a user-oriented and contextual level.

23/25

Contributions and Implications

Malcolm Clark

Page 24: 1/25 Malcolm Clark Supervisors: Professor Patrik O'Brian Holt Dr Ian Ruthven Genre Analysis of Structured E-mails for Corpus Profiling Workshop on Corpus.

• Focus on narrowing down my work domains.• Investigate domains:

• Academic documents collections: CSIRO Enterprise

• Legal documents - Enron• Weblogs – TREC Blog• Web domains - Wikipedia

• Consider multi-genres e.g. course books, large documents e.g social work report

24/25

Future Work

Malcolm Clark

Page 25: 1/25 Malcolm Clark Supervisors: Professor Patrik O'Brian Holt Dr Ian Ruthven Genre Analysis of Structured E-mails for Corpus Profiling Workshop on Corpus.

Malcolm Clark 25/25

Page 26: 1/25 Malcolm Clark Supervisors: Professor Patrik O'Brian Holt Dr Ian Ruthven Genre Analysis of Structured E-mails for Corpus Profiling Workshop on Corpus.

• Useful features for profiling corpora.• Adds another type of filtering to large data

collections to take advantage of genre i.e. news, biographical etc.

• Genre benefits organisations financially and administratively i.e. rapid retrieval of information.

• Embrace genre and perception to understand and examine these structures!

26/25

Motivation

Malcolm Clark

Page 27: 1/25 Malcolm Clark Supervisors: Professor Patrik O'Brian Holt Dr Ian Ruthven Genre Analysis of Structured E-mails for Corpus Profiling Workshop on Corpus.

• Model the findings based on FERRET and McFRUMP’s Predictor and Substantiator.

• Our system: Genre Retrieval and Understanding Memory Program or GRUMP.

• Similar features to Clark and Watt (2007)?

27/25

Evaluation System

Malcolm Clark

Page 28: 1/25 Malcolm Clark Supervisors: Professor Patrik O'Brian Holt Dr Ian Ruthven Genre Analysis of Structured E-mails for Corpus Profiling Workshop on Corpus.

28/25

Skimming & Categorisation Skimming• Used to identify the main points in a text much quicker than normal reading without having to understand every word. • Normally used when a reader has a large amount of text to read within a limited time.

Categorisation• Automatically labelled or classified.• No need for manual organisation, labelling or sorting.

Malcolm Clark

Page 29: 1/25 Malcolm Clark Supervisors: Professor Patrik O'Brian Holt Dr Ian Ruthven Genre Analysis of Structured E-mails for Corpus Profiling Workshop on Corpus.

29/27

Evaluation System – How it WorksTexts

McFRUMP Parser

AbstractsCase frame patterns

Query Parser

Queries

Case Frame Matcher

Relevant Texts

Figure taken from Mauldin 1991

McFRUMP parser contains the Predictor/Substantiator, Scripts etc

Malcolm Clark

Page 30: 1/25 Malcolm Clark Supervisors: Professor Patrik O'Brian Holt Dr Ian Ruthven Genre Analysis of Structured E-mails for Corpus Profiling Workshop on Corpus.

30/25

Evaluation System – Script ExampleUsing Schank’s (1981, ch 3) Conceptual Dependency theory of Scripts, Plans and Goals and DeJong’s (1982) FRUMP make different genre script’s:

John Doe was arrested last Saturday morning after holding up the New Haven Savings Bank

$ARREST SCRIPT

•Police arrive at suspect location

•Suspect Apprehended

•Taken to police station

•Charged

•Incarcerated or bailed

Using this type of script format to understand stories, genre rules/features can be specified in scripts to understand texts.

Modify script with

genre rules


Recommended