+ All Categories
Home > Documents > The Development of E2T and T2E Active Reading via Web Asanee Kawtrakul and Teams Kasetsart...

The Development of E2T and T2E Active Reading via Web Asanee Kawtrakul and Teams Kasetsart...

Date post: 27-Dec-2015
Category:
Upload: emmeline-farmer
View: 217 times
Download: 2 times
Share this document with a friend
Popular Tags:
59
The Development of E2T The Development of E2T and T2E Active Reading and T2E Active Reading via Web via Web Asanee Kawtrakul and Teams Kasetsart University, Bangkok, Thailand [email protected] Fifth Agricultural Ontology Service (AOS) Workshop
Transcript

The Development of E2T The Development of E2T and T2E Active Reading and T2E Active Reading

via Webvia Web

Asanee Kawtrakul and TeamsKasetsart University, Bangkok, Thailand

[email protected]

Fifth Agricultural Ontology Service (AOS) Workshop29 April 2004, Beijing, China

2

OutlineOutlineMotivationObjectivesSystem OverviewMethodologiesExampleConclusion and Future work

3

AcknowledgementAcknowledgementKURDI

Kasetsart University Research and Development Institute

4

CollaborationCollaborationLibrary Institute of Kasetsart UniversityProviding thesaurus and Agricultural Corpus

5

MotivationMotivation Valued data scattering

throughout the organization in multi-language

Good Information collected by many individuals in unstructured format

Digested information gives quicker decision-making

6

Proposed projectProposed projectSummarization

From unstructured to structured format

Only the gist of information

TranslationFrom English to Thai (E2T)Thai to English (T2E)

7

ObjectivesObjectives To develop a system for

summarizing and translating the agricultural information from English to Thai using statistical and frame-based approach (E2T)

To support the development of information discovery and web-based information exchange in the agricultural domain(T2E)

E2TE2T

9

Summarization (Input)Summarization (Input)

Let us focus on Canada’s agricultural products. In 1998, there were 1,216 registered commercial egg producers in Canada. Ontario produced 39.8% of all eggs in Canada, Quebec was second with 16.6%. The western provinces have a combined egg production of 35.6% and the eastern provinces have a combined production of 8.0%.

With a courtesy of Agriculture and Agri-Food Canada, http://www.agr.ca/cb

10

Summarization (Cube)Summarization (Cube)

ProducProductsts

CountCountryry

RegionRegion YeaYearr

QuantiQuantityty

Egg Canada

State: Ontario 1998

39.8%

Egg Canada

State: Quebec 1998

16.6%

Egg Canada

Western Provinces

1998

35.6%

Egg Canada

Eastern Provinces

1998

8.0%

11

Other OutputOther Output

0

2

4

6

8

มกราค

ก�มภาพั

นธ์

ม�นาคม

เมษายน

พัฤษภ

าคม

ม�ถุ�นา

ยน

กรกฎา

คม

สิ�งหา

คม

กนยาย

ตุ�ลาคม

พัฤศจิ

�กายน

ธ์นวาค

ม ข้�าวโพัด

ข้�าวโพัดถุ"วเหล#องข้�าว

12

Some related worksSome related works Frame

Knowledge representation in form of slot and filler

Consisting of attributes and their values

CategoCategoryry

Paddy

ExportExporterer

Thailand

PricePrice 300

UnitUnit Dollars/Ton

Attributes

Values

13

MethodologiesMethodologies Integration of NLP techniques and

data cube structure Gist of information extracted and summarized by

frames and then translated into the target language

Data cube structure supporting efficient data access management and powerful decision making

Focusing on the case Agricultural summary articles which have

merely similar structure

14

Why needs NLP Why needs NLP techniques?techniques? NP Analysis

To extract the name entity for activating a frame

To enhance the performance of indexing Word sense Disambiguation

Pound1) The basic monetary unit of the United

Kingdom2) Unit of mass and weight

15

System OverviewSystem Overview

GatheringModule

DocumentDatabase

Indexingand Clustering

Module

Internet

SummarizationModule

TranslationModule Data Cube

GraphicalUser Interface

16

Gathering ModuleGathering Module

Web RobotInternet Preprocessing

DocumentDatabase

AgriculturalPapers’

Abstracts

17

Indexing and Clustering Indexing and Clustering ModuleModule

Lexical TokenIdentification

WeightComputation

PhraseExtraction

Multi-levelIndexing

(Word, Phrase,and Concept)

DocumentClassification

(Statistical Method)

Documents

Clusters ofDocuments

18

Summarization ModuleSummarization Module

Document SentenceFiltering

ShallowParsing

SentenceStructures

FrameGeneration

Frames TranslationTemplates

(Depending onContent’s Domain)

Data Cube Knowledge Base:Frame, Thesaurus

19

Summarization (Input)Summarization (Input)

Let us focus on Canada’s agricultural products. In 1998, there were 1,216 registered commercial egg producers in Canada. Ontario produced 39.8% of all eggs in Canada, Quebec was second with 16.6%. The western provinces have a combined egg production of 35.6% and the eastern provinces have a combined production of 8.0%.

With a courtesy of Agriculture and Agri-Food Canada, http://www.agr.ca/cb

20

Summarization Summarization (Filtering)(Filtering)

In 1998, there were 1,216 registeredcommercial egg producers in Canada.

Ontario produced 39.8% of all eggsin Canada.

Quebec was second with 16.6%

The western provinces have a combinedegg production of 35.6%.

The eastern provinces have a combinedproduction of 8.0%.

Let us focus on Canada’s agriculturalproducts.

IndicesIndices WeightWeight

Number Very High

Egg- High

% High

Produc- Medium

In 1998, there were 1,216 registeredcommercial egg producers in Canada.

Ontario produced 39.8% of all eggsin Canada.

Quebec was second with 16.6%

The western provinces have a combinedegg production of 35.6%.

The eastern provinces have a combinedproduction of 8.0%.

Let us focus on Canada’s agriculturalproducts.

21

Summarization Summarization (Templates)(Templates)

ProductProduct X

CountryCountry X

RegionRegion X

YearYear X

QuantitQuantityy

X

22

Summarization Summarization (Frames)(Frames)

ProductProduct Egg

CountryCountry Canada

RegionRegion State: Ontario

YearYear 1998

QuantitQuantityy

39.8%

ProductProduct Egg

CountryCountry Canada

RegionRegion Eastern Provinces

YearYear 1998

QuantitQuantityy

8.0%

ProductProduct Egg

CountryCountry Canada

RegionRegion State: Quebec

YearYear 1998

QuantitQuantityy

16.6%

ProductProduct Egg

CountryCountry Canada

RegionRegion Western Provinces

YearYear 1998

QuantitQuantityy

35.6%

23

Summarization (Cube)Summarization (Cube)

ProducProductsts

CountCountryry

RegionRegion YeaYearr

QuantiQuantityty

Egg Canada

State: Ontario 1998

39.8%

Egg Canada

State: Quebec 1998

16.6%

Egg Canada

Western Provinces

1998

35.6%

Egg Canada

Eastern Provinces

1998

8.0%

24

Translation ModuleTranslation Module

User’s Query QueryProcessing

Data Cube

Translationand MeasurementUnit Conversion

Biligual Dictionary and ThesaurusVisualization

Tool

25

Translation ResultTranslation Result

CategoCategoryry

ExportExporterer

YeaYearr

MontMonthh

PricPricee

UnitUnit

Paddy Thailand 2002 January 300 Dollars/Ton

Paddy Thailand 2002 February

285 Dollars/Ton

ประเประเภทภท

ผู้)�สิ*งผู้)�สิ*งออกออก

ป+ป+ เด#อนเด#อน ราคราคาา

หน*วยหน*วย

ข้�าวเปลื�อก

ประเทศไทย

2545

มกราคม

14,340

บาทต่�อเกว�ยน

ข้�าวเปลื�อก

ประเทศไทย

2545

ก�มภาพั�นธ์�

13,625

บาทต่�อเกว�ยน

26

Web-based User Web-based User InterfaceInterfaceTo make inquiries about the

history of agricultural products’ price, including their chronological, statistical data

27

OutputOutput

0

2

4

6

8

มกราค

ก�มภาพั

นธ์

ม�นาคม

เมษายน

พัฤษภ

าคม

ม�ถุ�นา

ยน

กรกฎา

คม

สิ�งหา

คม

กนยาย

ตุ�ลาคม

พัฤศจิ

�กายน

ธ์นวาค

ม ข้�าวโพัด

ข้�าวโพัดถุ"วเหล#องข้�าว

28

Current State E2T: Current State E2T: the the systemsystemParser: Shallow parsing

English to ThaiSummarization and Translation: Frame-basedText to relational database

29

ParserParser

Big dog loves small cat.

S

vp

np

small

adj

cat

nloves

v

np

big

adj

dog

n

S

vp

np

แมว

n

เล-ก

adjรก

v

np

สิ�นข้

n

ใหญ่*

adj สุ�น�ข้ ใหญ่� ร�ก แมว เลื#ก

/sulnakh yail rakh määwm lekh/

SL Analysis

TL Generation

Transfer

T2ET2E

31

Input and OutputInput and Output Input characteristics (SL)

Web pages must be of ‘html’ file only Web pages displayed in Thai

Output characteristics (TL) The system will display output in English by

popping up the new window

32

Why Translate only Why Translate only Table?Table?

From the survey, the agricultural web pages could be divided into 3 types

– Full text– Tables with contexts– Tables only (approx. 50%)

33

Table Characteristics Table Characteristics (cnt.)(cnt.)

Numeric

Heading (Outside Table)Pure TextsUnit

34

Table Characteristics Table Characteristics

(cnt.)(cnt.)

Unit outside table

Unit Inside table

35

Input Format ExampleInput Format Example Input as Frame format

Department of Internal Trade(DIT)

Office of Agriculture Economics(OAE)

36 Agricultural Economics News

Tables only

Bullet

Picture

37

System overviewSystem overview

Pages TableAnalysis

Chunk-level Translation

UnitConversion

OutputGeneration

Output

Dictionary & Grammar

Rules

ConversionTable

38

Input Webpage

HTML File

InternetInternet

Web Robot

39

Table AnalysisTable Analysis

HTML File

Html Parser

Tag with position anchor

Text with position anchor

40

Position Anchor (Position Anchor (Table Table AnalysisAnalysis))

Using letter to stand for the data’s position in each cell of table

T stands for ‘table’ R stands for ‘row’ C stands for ‘column’

41

Keyword Definition Keyword Definition ExampleExample((Table AnalysisTable Analysis))ข้�าว 1999 2000

ประเทศไทย

24,245

28,356

ข้�าวโพัด 1999

ประเทศไทย 2,172,000

The result will be:

T1R1C1 ^ ข้�าวT1R1C2 ^ 1999T1R1C3 ^ 2000T1R2C1 ^ ประเทศไทยT1R2C2 ^ 24,245T1R2C3 ^ 28,356

T2R1C1 ^ ข้�าวโพัดT2R1C2 ^ 1999T2R2C1 ^ ประเทศไทยT2R2C2 ^ 2,172,000

42

Chunk-level Chunk-level TranslationTranslation

Text with Keyword

Phrase Chunker& NE Extraction

Dictionary & Grammar

Rules

Translated File

43

Phrase Chunker (cnt.)Phrase Chunker (cnt.)(Chunk level (Chunk level Translation)Translation) rulesrules1: np n+ vp vp aux? v n

ราคา น&าเข้�า สุ'นค�าn v n

vp

np

1:

2:

3:

44

Phrase Chunker Phrase Chunker (Chunk level (Chunk level Translation)Translation)

45

Chunk level Chunk level Translation (cnt.)Translation (cnt.)Handle with Name Entity!

NE cannot be word-by-word translated

e.g. กองควบค�มพั�ชแลืะว�สุด�การเกษต่ร Chunker AGRICULTURAL PLANT AND

MATERIAL CONTROL DIVISION NE Extraction AGRICULTURAL

REGULATORY DIVISION

46

Table Characteristics Table Characteristics (Unit Conversion)(Unit Conversion)Unit outside table

Unit Inside table

1

2

47

Unit conversion Unit conversion (cnt.)(cnt.)

48

Sentence GenerationSentence Generation

rulesrules1: np n+ vp vp aux? v+ n

ราคา น0าเข้�า สิ�นค�าn v n

vp

np

1:

2:

3:

49

Sentence Generation Sentence Generation (cnt.)(cnt.)

[NP ราคา[vp น&าเข้�า สุ'นค�า]]

[NP [np สุ'นค�า น&าเข้�า]ราคา]

[NP [np goods importing]ราคา]

Transfer rulesTransfer rules

Thai Englishnp n+ vp np adjp n+vp v+ n adjp adj* | np

[NP [np goods importing] price]

50

ResultResult

ActiveReading

51

Available Web sitesAvailable Web sites Department of Internal Trade

http://www.dit.go.th/ Office of the Rubber Replanting Aid Fund

http://www.thailandrubber.thaigov.net/menu5.php http://www.talaadthai.com/pricebase/default.asp http://www.rubberthai.com/price/price_index.htm http://www.thaifruitnews.com/

Multilinguality Multilinguality ExtensionExtension

53

54

Structure of ML-Dictionary Structure of ML-Dictionary

(New version)(New version) Main language: English (Vocabulary and POS.)

Separate table for each language. Vocabularies that have the same

meaning are linking together by ID attribute.

Supported 10 languages:Bahasa Indonesian, Chinese, English, French, Italian, Japanese, Korean, Tagalog, Thai and Vietnamese.

UTF-8 Character encoding.

55

User Interface example.User Interface example. Adding new vocabulary user interface

56

User Interface example. User Interface example. (cont’)(cont’) Query vocabulary user interface

57

Current result based on Current result based on FAO statFAO stat English – 23,207 vocabularies. French – 1,482 vocabularies. Thai – 23,097 vocabularies. Vietnamese – 175 vocabularies. Japanese – 108 vocabularies. Bahasa Indonesian – 13 vocabularies. Chinese, Italian, Korean and Tagalog

– 0 vocabulary.

58

Future workFuture work

Web-based Multilingual Active Reading System for Information ExchangeLanguage ConfigurationActive Reading assistantTable Translator with more multilingual dictionary

Thank youThank you


Recommended