+ All Categories
Home > Documents > Q - LTRC - IIIT Hyderabad

Q - LTRC - IIIT Hyderabad

Date post: 16-Mar-2022
Category:
Upload: others
View: 5 times
Download: 0 times
Share this document with a friend
128
Natural Language Processing based on and for Information Explosion on the Web Sadao Kurohashi Kyoto University / NICT (TCS NLP Winter School 2008, 2008/1/5, IIIT, Hyderabad, India)
Transcript
Page 1: Q - LTRC - IIIT Hyderabad

Natural Language Processingbased on and for

Information Explosion on the Web

Sadao KurohashiKyoto University / NICT

(TCS NLP Winter School 2008, 2008/1/5, IIIT, Hyderabad, India)

Page 2: Q - LTRC - IIIT Hyderabad

Search• Web is influential to:

– People's daily life– Enterprise management – Governmental policy decision

• 75% of people would rather use the web to answer their questions than their own family members.

• Service-industry-workers spend 30% of their time for search.

• 50% of complex queries go unanswered

Page 3: Q - LTRC - IIIT Hyderabad

High-Performance Computing Environment

800 CPU-core, 100TB storage

Page 4: Q - LTRC - IIIT Hyderabad

減っている

数 が

ミンク鯨 の

増えている 絶滅しかかっているミンク鯨 の 数 は 問題 は ミンク鯨 だ

Word segmentation and identification

増えている

ミンク鯨 の

数 は

Predicate-argument structure analysis

絶滅しかかっている

問題 は

ミンク鯨 だ

Conflict

ミンククジラの数は増えている 問題はミンク鯨だ.絶滅しかかっているminke whale number increasing problem is minke whale. face extiction

Anaphora resolution

Flexiblematching

Deep NLP ⇒ Information Credibility

Page 5: Q - LTRC - IIIT Hyderabad

NLP based on Information Explosion on the Web

NLP for Information Explosion on the Web

• Compilation of a basic lexicon and robust morphological analysis

• Case frame acquisition and predicate-argument structure analysis

• Synonymous expression acquisition and flexible matching

• Open search engine infrastructure• Information organization system• Information credibility analysis system

Page 6: Q - LTRC - IIIT Hyderabad

Japanese

サッカーのカメルーン代表が、ケニアで大統

領選をめぐり暴動が発生していることを受け、アフリカ選手権(20日開幕、ガーナ)に備えて同国内で行う予定だった10日間の練習合宿を取りやめたことが2日、分かった。AFP通信が伝えた。合宿中に予定されていたケニア代表との強化試合も中止となった。

http://www.asahi.com/sports/update/0103/JJT200801030002.html

Page 7: Q - LTRC - IIIT Hyderabad

Characteristics of Japanese

• No space between words ⇒ Segmentation• Four sets of letters: ⇒ Synonym

– HIRAGANA e.g., いんど

– KATAKANA e.g., インド

– Chinese characters e.g., 印度(KANJI)

– English alphabet e.g., India

Page 8: Q - LTRC - IIIT Hyderabad

a. Head finalb. Free word orderc. Postpositions function as case markersd. Hidden case markerse. Omission of case components

Characteristics of Japanese

Page 9: Q - LTRC - IIIT Hyderabad

a. Head finalb. Free word orderc. Postpositions function as case markers

Characteristics of Japanese

Kare -he

Deutschgo -German

hanasu.speak

ganom

woacc

(He speaks German.)

Page 10: Q - LTRC - IIIT Hyderabad

d. Hidden case markers

Characteristics of Japanese

Kare -he

Deutschgo -German

hanasu.speak

watopic marker

woacc

ga? wo?(He speaks German.)

Deutschgo -German

hanasuspeak

woacc

sensei …teacher

(the teacher who speaks German)

ga

ga

Page 11: Q - LTRC - IIIT Hyderabad

Characteristics of Japanese

φ-ganom

e. Omission of case components

Deutschgo -German

hanasuspeak

woacc

sensei -teacher

woacc

yatotta.hire

(φ hired a teacher who speaks German.)

Page 12: Q - LTRC - IIIT Hyderabad

Compilation of a basic lexicon and robust morphological analysis

Page 13: Q - LTRC - IIIT Hyderabad

Basic Lexicon

• Dictionaries for human: 200,000 entries• EDR: 200,000 entries→ Side effects for segmentation

Hard to maintain

⇒ 30,000 words (97% coverage for news texts)

Page 14: Q - LTRC - IIIT Hyderabad

Spelling Variation

かに

カニ

(crab)

→ 蟹/かにKanji

Hiragana

Katakana

Representative Form (ID)

Page 15: Q - LTRC - IIIT Hyderabad

Spelling Variation

落ちる

落る

おちる

(drop)↓

落ちる/おちる

綺麗だ

奇麗だ

きれいだ

(beautiful)↓

綺麗だ/きれいだ

子供

子ども

こども

(child)↓

子供/こども

Page 16: Q - LTRC - IIIT Hyderabad
Page 17: Q - LTRC - IIIT Hyderabad

Other Information for Basic Lexicon

• Possibility Form– 書ける(can-write) → 書く(write)

• Honorific Form– 召し上がる(eat) → 食べる(eat)

• Category (22 classes, e.g, <human> <organization> …)

• Domain (12 classes, e.g., <business> <education> …)

Page 18: Q - LTRC - IIIT Hyderabad

Robust Morphological Analysis

上海ガニをばくばく食べたShanghai in big mouthfuls eatcrab

(BAKU-BAKU)onomatopoeia

GANI ⇔ KANI(カニ)

Page 19: Q - LTRC - IIIT Hyderabad

Case frame acquisition and predicate-argument structure analysis

Page 20: Q - LTRC - IIIT Hyderabad

Language Understanding and Common sense

Mary ate the salad with a forkMary ate the salad with mushrooms

クロールで泳いでいる女の子を見た

望遠鏡で泳いでいる女の子を見た

crawl swim girl saw

telescope swim girl saw

Case frame泳ぐ swim

{人 person, 子 child,…}が

{クロール crawl, 平泳ぎ,…}で

{海 sea, 大海,…}を

見る see{人 person, 者,…}が

{望遠鏡 telescope, 双眼鏡 ,,…}で

{姿 figure, 人 person,…}を

Page 21: Q - LTRC - IIIT Hyderabad

Case frames for90K predicates

[Kawahara and Kurohashi, HLT2001, COLING2002, LREC2006]

500M sentences(20M pages)

WEB

Parsing (KNP)Filtering

Predicate-argumentstructures

86.7% for all97.3% for 18.1% PAs 86.7% → 87.4%

Clustering

Page 22: Q - LTRC - IIIT Hyderabad

PC Clusters (350CPUs)

Page 23: Q - LTRC - IIIT Hyderabad

500M sentences(20M pages)

Parsing (KNP)Filtering

Predicate-argumentstructures

Clustering

1day

7days

Case frames for90K predicates

WEB

86.7% for all97.3% for 18.1% PAs 86.7% → 87.4%

[Kawahara and Kurohashi, HLT2001, COLING2002, LREC2006]

Page 24: Q - LTRC - IIIT Hyderabad

Building a web corpus1. Crawl the web2. Extract Japanese page candidates using encoding

information• charset attribute, perl Encode::guess_encoding()

3. Judge Japanese pages using linguistic information (20M pages)• Jap. postpositions (ga, wo, ni, …) > 0.5%

4. Extract sentences from each page5. Extract Japanese sentences

• HIRAGANA, KATAKANA, KANJI > 60%6. Delete duplicate sentences→ 500M Japanese sentences (Japanese: 995 / 1,000)

Page 25: Q - LTRC - IIIT Hyderabad

もれなくプレゼント!(Present it to you all!)

でも僕はTシャツの上に長袖のシャツ。(But, I wear a long-sleeved shirt on a T-shirt.)

今回は某アイドルの高橋一也も参加したので客が若い。(Since Kazuya Takahashi, who is an idol, joined this time, the audience was young.)

団体Aが「まちづくり」をテーマにインターネット上で公開講座を開催しようとしている。(The organization A is trying to hold an open class about “city planning” on the Internet.)

htaccessを置いたとたんそのディレクトリ以下で.(As soon as you put htaccess, under the directory.)

昨年の没後400年祭を機に復元した井戸を紹介する木下さん(This is Mr. Kinoshita, who introduces a well restored last year marking fourth centennial

of the death.)恋は、真剣勝負。

(Love is a game played in earnest.)ほめ言葉が多くって嬉しいですね。

(I’m glad to receive many compliments.)いまだに言うでしょう。

(You still say that.)「買いパラ」を見たと伝えれば、お買い上げ合計金額より5%引きいたします。

(If you say that you saw “Kaipara”, we offer a 5% discount from all the bills.)政治も危機的状況ですし、物資も不足しています。

(Politics is at a crisis, and commodities are scarce.)思いやりのある優しい子に育ってネ。

(Grow up to be a considerate and kind person.)

Page 26: Q - LTRC - IIIT Hyderabad

Compiling case frames from the web corpus

• Collect reliable parse results (predicate-arguments) of the web corpusAccuracy: 86.7% (all) → 97.3% (18.1%)

• Semantic ambiguity, scrambling, omission → A verb and its closest argument are coupled

望遠鏡で 泳いでいる 女の子を 見たtelescope swim girl saw

泳いでいる 女の子を 望遠鏡で 見たtelescope swim girl saw

Page 27: Q - LTRC - IIIT Hyderabad

jugyoin -worker

kuruma -car

nimotsu -baggage

tsumuload

ganom

nidat

woacc

nimotsu -baggage

woacc

nimotsu -baggage

busshi -supply

hikouki -airplane

busshi -supply

tsumuload

nidat

woacc

tsumuload

tsumuload

kare -he

nidat

ganom

woacc

woacc

tsumuload

keiken -experience

woacc

tsumuaccumulate

truck -truck

keiken -experience

tsumuaccumulate

woacc

sensyu -player

ganom

kuruma -car

nidat

jugyoin -worker

ganom

sagyosya -operator

ganom

Page 28: Q - LTRC - IIIT Hyderabad

jugyoin -worker

kuruma -car

nimotsu -baggage

tsumuload

ganom

nidat

woacc

nimotsu -baggage

woacc

nimotsu -baggage

busshi -supply

hikouki -airplane

busshi -supply

tsumuload

nidat

woacc

tsumuload

tsumuload

kare -he

nidat

ganom

woacc

woacc

tsumuload

keiken -experience

woacc

tsumuaccumulate

truck -truck

keiken -experience

tsumuaccumulate

woacc

sensyu -player

ganom

kuruma -car

nidat

jugyoin -worker

ganom

sagyosya -operator

ganom

Page 29: Q - LTRC - IIIT Hyderabad

jugyoin -worker

kuruma -car

nimotsu -baggage

tsumuload

ganom

nidat

woacc

nimotsu -baggage

woacc

nimotsu -baggage

busshi -supply

hikouki -airplane

busshi -supply

tsumuload

nidat

woacc

tsumuload

tsumuload

kare -he

nidat

ganom

woacc

woacc

tsumuload

keiken -experience

woacc

tsumuaccumulate

truck -truck

keiken -experience

tsumuaccumulate

woacc

sensyu -player

ganom

kuruma -car

nidat

jugyoin -worker

ganom

sagyosya -operator

ganom

Page 30: Q - LTRC - IIIT Hyderabad

Case frame examples

R:1583, CD:664, CDR:3, …ni

yaku (3)(copy)

bread:2484, meat:1521, cake:1283, …woyaku (1)(bake)

data:178, file:107, copy:9, …womaker:1, distributor:1, …ga

attack:18, action:15, son:15, …nihand:2950wo

oven:1630, frying pan:1311, …deteacher:3, government:3, person:3, …gayaku (2)

(have difficulty)

I:18, person:15, craftsman:10, …gaexamplesCS

Page 31: Q - LTRC - IIIT Hyderabad

Statistics of the acquired case frames

4.226.9Average # of unique examples for CS29.872.9Average # of examples for CS

2.43.2Average # of CS for a case frame17.534.3Average # of case frames for a verb

461444262noun+copula9914121adjective

1264140860verb1824689243# of predicatesnewsweb

Page 32: Q - LTRC - IIIT Hyderabad

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

-10 -8 -6 -4 -2 0 2 4

Corpus size

●:Similar match■:Exact match

31M 62M 125M 250M 500M 1G

(cf. Penn treebank based lexical parser: 1.5% [Bikel 04])

Coverage (bi-lexical dependency)

Page 33: Q - LTRC - IIIT Hyderabad

Case frame search is available

Page 34: Q - LTRC - IIIT Hyderabad

Related Work

• Subcategorization frame acquisition[Brent, 1993] [Ushioda et al., 1993] [Manning, 1993] [Briscoe and Carroll, 1997]…

• FrameNet [Baker et al., 1998]

• PropBank [Palmer et al., 2005]

• Unsupervised learning for English [McClosky et al., 2006]

Page 35: Q - LTRC - IIIT Hyderabad

Integrated probabilistic model for syntactic and case structure analysis

[Kawahara and Kurohashi, HLT-NAACL2006]

Page 36: Q - LTRC - IIIT Hyderabad

dinner-wa eat-te go_home-tabangohan-wa tabe-te kaet-ta

dinner-wa

eat-te

go_home-ta

dinner-wa

eat-te

go_home-taEOSEOS

)|go_home( EOStaP − )|go_homedinner( EOStawaP −−

)go_home|eatdinner( tatewaP −−− )go_home|eat( tateP −−005.0= 003.0=

002.0= 000001.0=>

Page 37: Q - LTRC - IIIT Hyderabad

Integrated model for syntactic and case structure analysis

( )

),,(maxarg)(

),,(maxarg

)|,(maxarg,

),(

),(

),(

SLTPSP

SLTP

SLTPLT

LT

LT

LTbestbest

=

=

=

Case structure L

∏∈T

hii

ibP

C

)|C(def

iCihb

clause

SInput sentence , Dep. structure T ,

Page 38: Q - LTRC - IIIT Hyderabad

)|()go_home,|)eatdinner((

)EOS|()EOS,|)go_home((

tatePtewaCSP

taPtaCSP

×−×

×

)|()go_home,|)eat((

)EOS|()EOS,|)go_homedinner((

tatePteCSP

taPtawaCSP

×××

)go_home|eatdinner()EOS|go_home(

tatewaPtaP

−−−×−

)go_home|eat()EOS|go_homedinner(

tatePtawaP

−−×−−

∏∈

=T

hiLTLTi

ibPSLTP

C),(),()|C(maxarg),,(maxarg

dinner-wa

eat-te

go_home-ta

dinner-wa

eat-te

go_home-taEOSEOS

eat1gawo dinner

eat1ga dinnerwo

eat2gawo dinnerni

eat2ga dinnerwoni

eat2gawoni dinner

… … …

eat1gawo

eat2gawoni

Page 39: Q - LTRC - IIIT Hyderabad

∏∈

×=T

hihiiLTLTi

iiffPwfCSPSLTP

C),(),()|(),|(maxarg),,(maxarg

Generative probability of case structure

),|()|()|( ilkilhi fCFCAPvCFPwvPi

Probability of generating predicate ivProbability of generating case frame from predicate ivlCF

Probability of generating case assignment from case frame lCFkCA

)go_home|eat(P )eat|( 1eatCFP

dinner-wa

eat-te

go_home-ta

Case frame CFeat1

dinner, lunch, …woperson, student, …ga

eat1

(no correspondence)

Case assignment CAk

),|(ihii wfCSP

Page 40: Q - LTRC - IIIT Hyderabad

Generative probability of case assignment

( )∏=

==1)(:

,,|,,1)(),|(jj sAs

jiljjjilk sfCFfnsAPfCFCAP

( )∏=

=×0)(:

,,|0)(jj sAs

jilj sfCFsAP

( )woteCFwawoAP ,,|,dinner,1)( 1eat= ( )gateCFgaAP ,,|0)( 1eat=

:jn content word:jf type

:js case slot

dinner-wa

eat-te

Case frame CFeat1

dinner, lunch, …woperson, student, …ga

eat1

(no correspondence)

Case assignment CAk

( )∏∈

×××=T

hiilkilhiLTLTi

iiffPfCFCAPvCFPwvPSLTP

C),(),()|(),|()|()|(maxarg),,(maxarg

go_home-ta

Page 41: Q - LTRC - IIIT Hyderabad

Resources for parameter estimation

Probability Resourcewhat is generated

CS analysis resultscase framecase frameswordsparse resultspredicateKyoto Text Corpussurface case

CS analysis resultscase slot

Kyoto Text Corpuspredicate typeKyoto Text Corpustopic markerKyoto Text Corpuspunctuation mark)|( ij fpP

),|( jij pftP

)|( jj scP

),1)(,|( jjlj ssACFnP =

),|}1,0{)(( jlj sCFsAP =

)|(ihi wvP

)|( il vCFP

),,|,(iii hhhii oupupP

Supervised

Unsupervisd

Page 42: Q - LTRC - IIIT Hyderabad

Experiments• Resources for parameter estimation

– Case frames: constructed from 500M web sentences– Parse, CS analysis results: analysis results of 6M web

sentences• Experiment for syntactic structure

(675 web sentences)– Evaluate the head of each bunsetsu

(except the last and second last bunsetsu)• Experiment for case structure

(215 web sentences)– Evaluate case interpretation of TM phrases (~ wa) and

clausal modifiees

Page 43: Q - LTRC - IIIT Hyderabad

Experimental resultsOur methodMere parsing

0.920 (457/497)0.911 (453/497)VB→NB0.791 (601/760)0.780 (593/760)VB→VB0.946 (526/556)0.944 (525/556)NB→NB0.869 (1086/1249)0.853 (1066/1249)others0.812 (242/298)0.819 (244/298)TM(~wa)0.858 (1328/1547)0.847 (1310/1547)NB→VB0.874 (3477/3976)0.867 (3447/3976)all

Dep. structure

Case structure

0.781 (121/155)0.690 (107/155)Clausal modifiee0.781 (82/105)0.686 (72/105)TM phrase

Our methodSim-based baseline

Page 44: Q - LTRC - IIIT Hyderabad

Improved examples

水が 高い ところから 低い ところへ 流れる。

?

すぐに 標識用の エビを 同港に 停泊した 当港所属調査船

「おやしお丸」に 搬送し、…

?

(water) (high) (place) (low) (place) (flow)

(soon) (for sign) (shrimp) (same port) (cast anchor)

(transfer)(“Oyashiomaru”)

(investigation ship)

Page 45: Q - LTRC - IIIT Hyderabad

Synonymous expression acquisition and flexible matching

Page 46: Q - LTRC - IIIT Hyderabad

Flexible Matching

• There are a lot of expressions that convey almost the same meaning– great difficulty in many NLP tasks

• Automatic extraction of synonymous expressions from a dictionary and a Web corpus [Shibata et al. IJCNLP2008]

• Flexible matching using SYNGRAPH data structure [Shibata et al. IJCNLP2008]

Page 47: Q - LTRC - IIIT Hyderabad

Automatic Acquisition of Synonymous Expressions

Web

pattern distributional similarity

parenthetic expression

Dictionary

BSE=bovine spongiform

encephalitis

husband=shedog=spy buy=purchase

Page 48: Q - LTRC - IIIT Hyderabad

Synonym and HypernymExtraction from a Dictionary

• Using the definition sentence patterns– Hypernym

• dinner: yugata (evening) no (of) syokuji (meal)

– Synonym• ice: “ice cream” no (of) ryaku (abbreviation)• purchase: kau (buy) koto (matter) (one phrase)

• Wide coverage, but includes exceptional or idiosyncratic usages – dog:1/2 → animal– dog:2/2 = spy– tap water:2/2 = strait

Page 49: Q - LTRC - IIIT Hyderabad

Distributional Similarity

• “Two terms are similar if they appear in similar contexts”– If two terms have similar co-occurrence words, the

two terms are similar• Calculate the distributional similarity using a

Web corpus (500M sentences)– co-occurrence in the dependency relation– co-occurrence words if PMI (Pointwise Mutual

Information) is positive– similarity is defined as the overlap of co-occurrence

words: calculate using the Simpson coefficient

Page 50: Q - LTRC - IIIT Hyderabad

e.g.: co-occurrence word and similar word of “doctor”

……10.281be stopped by10.506turn white11.024want to consult11.277be examined11.589be pronounced12.173see

PMICo-occurrence word

……0.565DOCTOR0.573eye doctor0.613teacher0.664midwife0.742veterinary0.754ENT doctor

Simpson efficient

Similar word

Page 51: Q - LTRC - IIIT Hyderabad

Synonym and HypernymExtraction from a Dictionary

• Using the definition sentence patterns– Hypernym

• dinner: yugata (evening) no (of) syokuji (meal)

– Synonym• ice: “ice cream” no (of) ryaku (abbreviation)• purchase: kau (buy) koto (matter) (one phrase)

• Wide coverage, but includes exceptional or idiosyncratic usages – dog:1/2 → animal– dog:2/2 = spy– tap water:2/2 = strait

0.419

0.1190.338

Page 52: Q - LTRC - IIIT Hyderabad

Synonym Extraction from a Web corpus

• Extract from symmetry parenthesis expressions– ..A(B).., ..B(A).. → A=B

• Can extract synonyms between NEs/terminologies/neologisms, which cannot be extracted from a dictionary– 国際連合教育科学文化機関 = ユネスコ

(UNESCO)– 放射性同位元素 = RI (radioisotope)– 携帯電話 = ケータイ (cellular phone)

Page 53: Q - LTRC - IIIT Hyderabad

Acc.

Web (parenthetic expression)

94%

96%

Acc.

5,225

23,292

9,274

#

Dic Dic & Web (Similarity)

98%Synonym

#

98%17,207Hypernym

99%6,867Synonym

Acc.#

accuracy: for randomly-selected 100 terms

Page 54: Q - LTRC - IIIT Hyderabad

NLP based on Information Explosion on the Web

NLP for Information Explosion on the Web

• Compilation of a basic lexicon and robust morphological analysis

• Case frame acquisition and predicate-argument structure analysis

• Synonymous expression acquisition and flexible matching

• Open search engine infrastructure• Information organization system• Information credibility analysis system

Page 55: Q - LTRC - IIIT Hyderabad

• 情報爆発(Cyber Infrastructure for the Information-explosion Era)– 文部科学省 科学研究費補助金(Ministry of

Education, Culture, Sports, Science and Technology, Grants-in-Aid for Scientific Research)

• 情報分析(Information Analysis Project)– 総務省(Ministry of Internal Affairs and

Communications)/NICT

Two Governmental Projects

Page 56: Q - LTRC - IIIT Hyderabad

Open search engine infrastructure

Page 57: Q - LTRC - IIIT Hyderabad
Page 58: Q - LTRC - IIIT Hyderabad
Page 59: Q - LTRC - IIIT Hyderabad
Page 60: Q - LTRC - IIIT Hyderabad

Yahoo/Google

Next-Generation Search

Page 61: Q - LTRC - IIIT Hyderabad

Search Engine InfrastructureT S U B A K I

Grid computing environmentand huge storage servers

Next-Generation Search

Page 62: Q - LTRC - IIIT Hyderabad

Search Engine Infrastructure TSUBAKI

Search Engine InfrastructureT S U B A K I

Grid computing environmentand huge storage servers

Next-Generation Search

• Reproducible search results– Fix 100 million Japanese web

pages* (May - July 2007)• Web standard format for

advanced NLP– Available at TSUBAKI API

• Deep NLP indexing– Spelling variations,

dependency relations and synonymous expressions

• Open search algorithm• API without any restriction

Page 63: Q - LTRC - IIIT Hyderabad

Web Standard Format

Page 64: Q - LTRC - IIIT Hyderabad

Web standard format

• Problems of use of web pages in NLP– Unclear sentence boundaries– Several meta-data in several tag formats– Spam

• A simple XML-styled data format for annotating meta-data and text-data of a web page– Meta-data:

• URL, crawl date, character encoding, title and anchor text (in-links/out-links)

– Text-data:• Sentences in a web page, and their analysis results by NLP

tools

Page 65: Q - LTRC - IIIT Hyderabad
Page 66: Q - LTRC - IIIT Hyderabad

しかしさすがに電池の持ちが悪くなってきたのと、<br>たまたまキャンペーンをやっていて無料で機種変できるみたいだったから<br>

愛着のわいた携帯を手放すことにした。</p>

<p>で、折角変えるならまた長く使えるのがいいじゃない?<br>すごいいいデザインのがあって(しかもロゴがsoftbank!!)<br>これに決めた!!</p>

<p>と思ったら…</p>

<p>なんと品切れ。他の店舗に問い合わせてもどこも品切れ。<br>唯一あったのが電車で40分かかる所。</p>

<p>もちろん行きました。</p>

<p>そこまでして手に入れた携帯だから前以上に既に愛着がわいてますw<br>また5年間使い続けるぞい!</p>

</div></div><p class="entry-footer">

<span class="post-footers">投稿者: KN006 日時: 2006年10月16日 22:05

</span>

Sentence boundaryLayout adjustmentMeta data

Page 67: Q - LTRC - IIIT Hyderabad

<?xml version="1.0" encoding="utf-8"?><StandardFormat Url="http://nlp.kuee.kyoto-u.ac.jp/blog/KUNTT_blog/2006/10/"OriginalEncoding="utf8" Time="2007-06-18 18:13:38"><Text>…中略…<S Offset="3490" Length="1070" is_Japanese_Sentence=“1" Id="10">

<RawString>しかしさすがに電池の持ちが悪くなってきたのと、たまたまキャンペー

ンをやっていて無料で機種変できるみたいだったから愛着のわいた携帯を手放すことにした。</RawString></S><S Offset="3490" Length="1070" is_Japanese_Sentence=“1" Id="11">

<RawString>で、折角変えるならまた長く使えるのがいいじゃない?</RawString></S><S Offset="3490" Length="1070" is_Japanese_Sentence=“1" Id="12"><RawString>すごいいいデザインのがあって(しかもロゴがsoftbank!!)これに決め

た!!と思ったら…</RawString></S>…中略…<S Offset="3490" Length="1070" is_Japanese_Sentence=“1" Id="18">

<RawString>そこまでして手に入れた携帯だから前以上に既に愛着がわいてますwまた5年間使い続けるぞい!</RawString></S></Text></StandardFormat>

Page 68: Q - LTRC - IIIT Hyderabad

<?xml version="1.0" encoding="utf-8"?><StandardFormat Url="http://nlp.kuee.kyoto-u.ac.jp/blog/KUNTT_blog/2006/10/"OriginalEncoding="utf8" Time="2007-06-18 18:13:38"><Text>…中略…<S Offset="3490" Length="1070" is_Japanese_Sentence=“1" Id="10">

<RawString>しかしさすがに電池の持ちが悪くなってきたのと、たまたまキャンペーンをやっていて無料で機種変できるみたいだったから愛着のわいた携帯を手放すことにした。</RawString>

<Annotation Scheme="Knp"><![CDATA[* 14D <BGH:しかし/しかし><文頭><接続詞><係:連用>しかし しかし しかし 接続詞 10 * 0 * 0 * 0 "代表表記:しかし/しかし"<自立><文節始>* 4D <BGH:流石/さすが><助詞><体言><修飾><係:ニ格><格要素><連用要素>さすが さすが さすが 副詞 8 * 0 * 0 * 0 "代表表記:流石/さすが" <自立><文節始>に に に 助詞 9 格助詞 1 * 0 * 0 NIL <付属>* 3D <BGH:電池/でんち><助詞><連体修飾><体言><係:ノ格>電池 でんち 電池 名詞 6 普通名詞 1 * 0 * 0 "ドメイン:家庭・暮らし カテゴリ:人工物-その他 代表表記:電池/でんち“ <名詞相当語><自立><文節始>… 中略 …した した する 動詞 2 * 0 サ変動詞 16 タ形 10 NIL <連体修飾><活用語><付属>。 。 。 特殊 1 句点 1 * 0 * 0 NIL <文末><英記号><記号><付属>EOS]]></Annotation></S>

Part-of-speechPart-of-speechDomainDomain CategoryCategory

Representative formRepresentative form

Page 69: Q - LTRC - IIIT Hyderabad

Deep NLP Indexing

Page 70: Q - LTRC - IIIT Hyderabad

Inverted Index

Page1, Page2, Page3ofPage3andPage3information

Page1, Page2, Page3problemPage1, Page2computerPage1, Page3language

language, problem, information, of, andPage3

computer, problem, ofPage2

language, computer, problem, ofPage1

Page 71: Q - LTRC - IIIT Hyderabad

Items in index data

O

O

O

O

Doc. ids

O

O

O

O

Freq. in a doc.

O

O

O

O

String

X

X

O

O

DF

OXSynonymous expressions

OXDep. of

synonymous expressions

XXDep. of words

OOWord

PositionSent. IDsIndex type

DF: document frequency

Page 72: Q - LTRC - IIIT Hyderabad

Sentences in a document

せんたくしましたwashedselected

服(clothes)

こども(child)

こども(child)

一緒に(together)

を(wo)

。(.)

と(to)

。(.)

Page 73: Q - LTRC - IIIT Hyderabad

Word index

せんたくしましたwashedselected

服(clothes)

こども(child)

こども(child)

一緒に(together)

721.0TO

311.0WO410.5WASH821.0TOGETHER

2.0

0.51.02.0

Freq.

1,2

11

1,2SID

1,6CHILD2CLOTH4SELECT

5,9.

PositionWords

S1

S2

Sentence IDs

を(wo)

。(.)

と(to)

。(.)

1 32

6 8

Positions in a document

4 5

97

Page 74: Q - LTRC - IIIT Hyderabad

Dependency relation index

せんたくしましたwashedselected

服(clothes)

こども(child)

こども(child)

一緒に(together)

を(wo)

。(.)

と(to)

。(.)

1.0CHILD→TOGETHER

0.5

0.5

1.0

Freq.

CHILD→CLOTH

CLOTH→WASH

CLOTH→SELECT

Dependency relations

Page 75: Q - LTRC - IIIT Hyderabad

5

Synonymousexpression index

S17250:洗濯しました。S10184:選択

S55:服をS11412:こども

S11412:こどもと S15355:一緒に。

1 42

6 8

A LITTLE CHILD, CHILD

CLOTHES, CLOTHING

GARMENTS, WEAR, DRESS

WASH, CLEAN

SELECT,CHOOSE

A LITTLE CHILD, CHILD

TOGETHER

3

7 9

40.5S17250:洗濯

1.02.00.51.0

Freq.

2S55:服4S10184:選択

1,6S11412:こども

8S15355:一緒

PositionSynonymousexpression sets

Page 76: Q - LTRC - IIIT Hyderabad

5

Dep. index ofSynonymousexpressions

S17250:洗濯しました。S10184:選択

S55:服をS11412:こども

S11412:こどもと S15355:一緒に。

1 42

6 8

A LITTLE CHILD, CHILD

WASH, CLEAN

SELECT,CHOOSE

A LITTLE CHILD, CHILD

20.5S55:服→S17250:洗濯

1.0

1.0

0.5

Freq.

2S55:服→S10184:選択

1S11412:こども→S55:服

6S11412:こども→ S15355:一緒

PositionDependency relationsbetween synonymous

expression sets

3

7 9

TOGETHER

CLOTHES, CLOTHING

GARMENTS, WEAR, DRESS

Page 77: Q - LTRC - IIIT Hyderabad

Query syntax• Natural languagesentence:

• Phrase search:

• Proximity search(word):

• Proximity search(sentence):

“京都大学”

京都大学~5S京都 and 大学 are co-occurred within 5 sentencesin that order.

京都大学~5W京都 and 大学 are co-occurred within 5 words inthat order

• Combination of the above search notations京都大学への行き方 “市バス”

京都大学(Kyoto univ.)

行き方(access)

への

(city bus)

Page 78: Q - LTRC - IIIT Hyderabad

w1 w2 w3Query:

Page 79: Q - LTRC - IIIT Hyderabad

w1 w2 w3

d1

d2

Query:

Page 80: Q - LTRC - IIIT Hyderabad

w1 w2 w3

d1

d2

Query:

Page 81: Q - LTRC - IIIT Hyderabad

w1 w2 w3

d1

d2

Query:

Page 82: Q - LTRC - IIIT Hyderabad

w1 w2 w3

d1

d2

Query:

Page 83: Q - LTRC - IIIT Hyderabad

w1 w2 w3

d1

d2

Query:

Page 84: Q - LTRC - IIIT Hyderabad

w1 w2 w3

d1

d2

Query:

Page 85: Q - LTRC - IIIT Hyderabad

w1 w2 w3

d1

d2

Query:

within N words

Page 86: Q - LTRC - IIIT Hyderabad

w1 w2 w3

d1

d2

Query:

within N words

Page 87: Q - LTRC - IIIT Hyderabad

( ) ),(),( dQreldQrelQ,dscore ddww +=

Scoring method

Score calculatedfrom words in Q

Score calculated fromdependency relations in Q

• Score calculated from a query Q for a document d

– Q=

⇒ Qw={子供,体力,低下}, Qd={子供→体力,体力→低下}

子供(child)

の(no)

体力(strength)

低下(decrement)

Page 88: Q - LTRC - IIIT Hyderabad

Scoring method

• Score calculated from word indices Qw inQ for a document d– Qw={子供(child),体力(strength),

低下(decrement)}

( ) ∑∈

⎟⎟⎠

⎞⎜⎜⎝

⎛++−

×+×

×=wQq

ww nnN

fqKfqqfq,dQrel

5.05.0log3

OKAPI BM25

( )avellbK

2312 +−=

fq: the frequency of the expression q in dqfq: the frequency of q in Q

n: the document frequency of q in 100 million pages

N: 1 x 108

l: the document length of dlave: the average document length over all the pages

Page 89: Q - LTRC - IIIT Hyderabad

Scoring method

• Score calculated from dependency relationindices Qd in Q for a document d– Qd=

( ) ∑∈

=dQq

dd dqh,dQrel ),(

if d includes qotherwise⎩

⎨⎧

=),(),(

),(dqgdqf

dqh

子供(child)

体力,(strength)

→ 低下(decrement)

体力(strength)

Page 90: Q - LTRC - IIIT Hyderabad

Scoring method (d includes q)

• Score calculated from dependency relation indices Qd in Q for a document d

( ) ⎟⎟⎠

⎞⎜⎜⎝

⎛++−

×+×

×=5.0

5.0log3n

nNfqKfqqfqq,df

OKAPI BM25

( )avellbK

2312 +−=

fq: the frequency of the expression q in dqfq: the frequency of q in Q

n: the document frequency of q in 100 million pages

N: 1 x 108

l: the document length of dlave: the average document length over all the pages.

Page 91: Q - LTRC - IIIT Hyderabad

Scoring method (d does not include q)

• Score calculated from dependency relation indices Qd in Q for a document d

( ) ⎟⎟⎠

⎞⎜⎜⎝

⎛++−

×+×

×=5.0

5.0log)()(3,

nnN

qwKqwqfqdqg

⎪⎩

⎪⎨⎧ −

=0

))(),(min()( D

qrqlDqw

l(q): parent of dependency relation qr(q): child of dependency relation qmin(q1,q2) : minimum distance between q1

and q2 (# of words)D: threshold of distance (D = 30)n: DF value of dependency relation q1→q2

min(l(q),r(q)) < D

otherwise

Page 92: Q - LTRC - IIIT Hyderabad

Contribution of deep NLP indices

0.2530.170Dependency relation(only f (q,d))

0.2300.168Dependency relation(f (q,d) + g (q,d))

0.2570.162Wordrepresentative form

0.2320.155Baseline

P10R-precision

(NTCIR 10M web test set)

Page 93: Q - LTRC - IIIT Hyderabad

4 master servers

16 snippet creation servers

Index data generated from a million web pages

Load balance server

UserParse a queryMerge retrieved pagesCreate a search result

Retrieve pagesRank retrieved pages

Snippet CreationSnippet Creation

A million web standard format data27 search servers

Page 94: Q - LTRC - IIIT Hyderabad

4 master servers

16 snippet creation servers

index data generatedfrom a million web pages

Load balance server

User

A million web standardformat data

27 search servers

9.3 GBTitle DB

132 GBtotal115 GBDF DB

7.9 GBURL DB

Data sizes per 100 million pages

index data sizes per a million pages

48 GBDep. of synonymousexpressions

85.9 GBtotal

18 GBSynonymousexpressions

(word and phrase)

8.9 GBDep.11 GBWord

31 GBWeb standard form data

Gzipped file size pera million pages

Page 95: Q - LTRC - IIIT Hyderabad

Required time per a query

RetrievingScore

calculation& ranking

Snippetcreation

Get title& URL

Merge &ranking

Search servers Master server

Query paring

Load balanceserverSend

a query

Masterserver

Snippetcreationservers

7.9 secondsGet hitcount by API

Get document IDs, titles and URLs of top 100 pages by API

Ordinary search (50 pages are shown)

Get document IDs, titles and URLs of top 1000 pages by API

9.7 seconds

32.6 seconds

12.7 seconds

* Document IDs are necessary for obtaining cached web pages and web standard format data

Page 96: Q - LTRC - IIIT Hyderabad

TSUBAKI API

Page 97: Q - LTRC - IIIT Hyderabad

TSUBAKI APIhttp://tsubaki.ixnlp.nii.ac.jp/api.cgi

• No user registration• No limited number of API calls a day• Provide all pages in a search result

– cf. Yahoo! API: top 1000 pages in a searchresult,Google AJAX Search API: top 8 pages in asearch result, and(previous) Google API: top 1000 pages in asearch result

• Provide web standard format data

Page 98: Q - LTRC - IIIT Hyderabad

Request parameters

The document type to return.html/xmlformat

The document ID to obtaina cached web page or standardformat data correspondingto the ID.

integeridSet to 1 to obtain a query’s hitcount only.0/1only_hitcount

The number of results to return.integerresultsThe logical operation tosearch for.

AND/ORlogical_operator

The starting result position to return.integerstart

The query to search for (UTF-8 encoded). The query parameter is required for obtaining search results.

stringQueryDescriptionValueParameter

http://tsubaki.ixnlp.nii.ac.jp/se/api.cgi?query=%E4%BA%AC%E9%83%BD%E8%A6%B3%E5%85%89&start=1&results=20http://tsubaki.ixnlp.nii.ac.jp/se/api.cgi?query=%E4%BA%AC%E9%83%BD%E8%A6%B3%E5%85%89&start=1&results=20

Page 99: Q - LTRC - IIIT Hyderabad

http://tsubaki.ixnlp.nii.ac.jp/se/api.cgi?query=%E4%BA%AC%E9%83%BD%E8%A6%B3%E5%85%89&start=1&results=20http://tsubaki.ixnlp.nii.ac.jp/se/api.cgi?query=%E4%BA%AC%E9%83%BD%E8%A6%B3%E5%85%89&start=1&results=20

URI encoded string of the query “京都観光”(Kyoto sightseeing)

Page 100: Q - LTRC - IIIT Hyderabad

http://tsubaki.ixnlp.nii.ac.jp/se/api.cgi?query=%E4%BA%AC%E9%83%BD%E8%A6%B3%E5%85%89&start=1&results=20http://tsubaki.ixnlp.nii.ac.jp/se/api.cgi?query=%E4%BA%AC%E9%83%BD%E8%A6%B3%E5%85%89&start=1&results=20

Page 101: Q - LTRC - IIIT Hyderabad

http://tsubaki.ixnlp.nii.ac.jp/se/api.cgi?format=html&id=06832381http://tsubaki.ixnlp.nii.ac.jp/se/api.cgi?format=html&id=06832381

Page 102: Q - LTRC - IIIT Hyderabad

http://tsubaki.ixnlp.nii.ac.jp/se/api.cgi?format=xml&id=06832381http://tsubaki.ixnlp.nii.ac.jp/se/api.cgi?format=xml&id=06832381

Page 103: Q - LTRC - IIIT Hyderabad

Conclusion• Search engine infrastructure TSUBAKI

– reproducible search results,– Web standard format for sharing pre-

processed web pages,– indices generated by deep NLP,– open search algorithm, and– APIs without any restriction

• Available from http://tsubaki.ixnlp.nii.ac.jp/index.cgi

Page 104: Q - LTRC - IIIT Hyderabad

Information organization system

Page 105: Q - LTRC - IIIT Hyderabad

Search Engine InfrastructureT S U B A K I

Grid computing environmentand huge storage servers

Next-Generation Search

Page 106: Q - LTRC - IIIT Hyderabad
Page 107: Q - LTRC - IIIT Hyderabad

Search result clustering

• Advantage– Provide a bird’s-eye view of a search result– Provide efficient access for necessarily pages– Provide low-ranked pages in a search result

• Requirement– Quickness of cluster construction– Quality of cluster labels

• Affect access of necessarily pages

Page 108: Q - LTRC - IIIT Hyderabad
Page 109: Q - LTRC - IIIT Hyderabad
Page 110: Q - LTRC - IIIT Hyderabad

Characteristics of our system

• Cooperation with search engine infrastructure TSUBAKI– Full text data in a web page and their

analyzed data– High-performance computing environment

• Label acquisition based on deep NLP– Assimilate expressive divergence

• Spelling variations• Synonymous expressions

Page 111: Q - LTRC - IIIT Hyderabad

Distillation of labels

Discardedサイトマップサイトマップ

得点力アップ

新教育課程

DiscardedDiscarded教育基本法改正

教育基本法改正案

得点力アップ

新教育課程

教育基本法改正案

得点力アップ得点力アップ

得点力UP

新カリキュラム

教育基本教育基本

新教育課程新教育課程

法改正法改正

...

教育基本法改正教育基本法改正

教育基本法の改正案

教育基本法改正案

教育基本法改正案

Assimilate expressive divergence

Eliminate inappropriatecompound nouns

Merge substrings

Page 112: Q - LTRC - IIIT Hyderabad

Overview of our clustering systemStep 1. Label acquisition Step 2. Cluster generation

国際捕鯨委員会

調査捕鯨

Step 3. Cluster organization Step 4. Display

○○○○○○○○

○○○○○○○○○○○○○○○○○○○○ ○○○○○○○○○○○○○ ○○○

○○○○○○ ○○○○ ○○○○

○○○○○○ ○○○○○○○○○○○ ○○○

○○○○○○○○○○

国際捕鯨委員会(IWC)調査捕鯨(Scientific whaling)…

Page 113: Q - LTRC - IIIT Hyderabad

Architecture

Search EngineTSUBAKI

QueryQuery

Query

Search result

Compound NounExtraction

Compound NounExtraction

Label Selection &Clustering

Search &Page ID Gathering

Web Standard Format Collection

ClustersClusters

Page 114: Q - LTRC - IIIT Hyderabad

Clustering result- Whaling problem -

• IWC (357pages)– 科学委員会

(Scientific committee)– 年次総会

(Annual meeting)– IWC総会

(IWC meeting)– 原住民生存捕鯨

(Aboriginal subsistencewhaling scheme)

– 鯨種(Species of whales)

…• 調査捕鯨 (145pages)

(Scientific whaling)– 日本の調査捕鯨

(Scientific whaling in Japan)• 捕鯨船 (65pages)○

(Whaling port)

△ The explanation of IWC,criticism of IWC and others

○ The explanation and the positiveor negative opinions of thescientific whaling

○ Accidents and history of whalingports

Page 115: Q - LTRC - IIIT Hyderabad

●IWC (357pages)科学委員会(Scientific committee)年次総会(Annual meeting)IWC総会(IWC meeting)原住民生存捕鯨(Aboriginal Subsistence Whaling Scheme)鯨種(Species of whales)

…●調査捕鯨 (145pages)

(Scientific whaling)日本の調査捕鯨(Scientific whaling in Japan)

●捕鯨船 (65pages)(Whaling port)

●南極海 (51pages)(Antarctic ocean)

1 st72 nd

41 st

44 th

37 th

32 nd

1st

94 th

6 th

19 th

Rank of a web page including a label

The clustering result of the query ``whaling problem’’

Users can find web pages that are low-ranked in a search result

Users can find web pages that are low-ranked in a search result

Page 116: Q - LTRC - IIIT Hyderabad

Conclusion

• Label based search result clustering system• Cooperation with search engine infrastructure

TSUBAKI– Full text data in a web page and their analyzed data– High-performance computing environment

• Label acquisition based on deep NLP– Assimilate expressive divergence

• Spelling variations• Synonymous expressions

Page 117: Q - LTRC - IIIT Hyderabad

Information credibility analysis system

Page 118: Q - LTRC - IIIT Hyderabad

Information Credibility Analysis

1. Credibility of information contents2. Credibility of information sender3. Credibility estimated from document style

and superficial characteristics4. Credibility based on social evaluation of

information contents/sender

Page 119: Q - LTRC - IIIT Hyderabad

1.Credibility of information contents

• Sentences in the related documents are classified into opinions, events, and facts, and opinion sentences are classified into positive opinions and negative opinions.

• Documents in each cluster should be summarized, by using multi-document summarization techniques and their extensions.

• Several relations such as similarities, oppositions, causal relations, supporting relations are detected among inner- and inter-cluster statements, which leads to the detection of logical consistency and contradiction.

Page 120: Q - LTRC - IIIT Hyderabad

減っている

数 が

ミンク鯨 の

増えている 絶滅しかかっているミンク鯨 の 数 は 問題 は ミンク鯨 だ

Word segmentation and identification

増えている

ミンク鯨 の

数 は

Predicate-argument structure analysis

絶滅しかかっている

問題 は

ミンク鯨 だ

Conflict

ミンククジラの数は増えている 問題はミンク鯨だ.絶滅しかかっているminke whale number increasing problem is minke whale. face extiction

Anaphora resolution

Flexiblematching

Deep NLP ⇒ Information Credibility

Page 121: Q - LTRC - IIIT Hyderabad

2. Credibility of information sender

• Information sender:– individuals

• expert or not• identified individual by handle-name, and others

– organizations• public organization (administrative organ, academic

association, universities), • media, • commercial companies, and others.

• Distinguished by:– meta-information such as URLs, page titles, anchor

texts, and RSS– NE extraction

Page 122: Q - LTRC - IIIT Hyderabad

2. Credibility of information sender

• Check the quantity and quality of information the sender produced so far.

• Information quality evaluation can be given based on the other 3 criteria.

• Speciality of individual and organization is important, which can be detected by topic detection.

Page 123: Q - LTRC - IIIT Hyderabad

3. Credibility estimated from document style and superficial characteristics

• Guessed by integrating many criteria such as sentential style (formal or informal, written-language or spoken-language), page layout, and appropriateness of links in the page, and so on.

• cf. Persuasive technology at Stanford University, and Google News automatic assembling criteria.

Page 124: Q - LTRC - IIIT Hyderabad

4. Credibility based on social evaluation of information contents/sender

• How they are evaluated by others.

• One way is to perform opinion mining from the web based on NLP, and collect and count positive and negative evaluations for the information content/sender.

• Another way is to directly use rankings and comments of others, like social network framework.

Page 125: Q - LTRC - IIIT Hyderabad

Sender Clustering Ontology Opinion Q & A

Information Credibility Analysis SystemWISDOM (2006~)

Page 126: Q - LTRC - IIIT Hyderabad

Pageclustering

Agaricus

Sender

Opiniondistribution

Information Credibility Analysis SystemWISDOM (2006~)

Page 127: Q - LTRC - IIIT Hyderabad

Summary• Several linguistic and extra-linguistic knowledge

can be acquired from the web corpus, using high-performance computing environment.

• Deep NLP, especially accurate predicate-argument structure analysis and flexible matching are key technologies for next generation search.

• Automatic information credibility evaluation is not easy, but information organization system and multi-faceted information analysis system help users’ evaluation a lot.

Page 128: Q - LTRC - IIIT Hyderabad

References• D. Kawahara and S. Kurohashi. Case frame compilation from the

web using high-performance computing. In Proceedings of LREC2006, 2006.

• D. Kawahara and S. Kurohashi. A fully-lexicalized probabilistic model for Japanese syntactic and case structure analysis. In Proceedings of HLT-NAACL2006, pages 176--183, 2006.

• H. Miyamori, S. Akamine, Y. Kato, K. Kaneiwa, K. Sumi, K. Inui, and S. Kurohashi. Evaluation data and prototype system wisdom for information credibility analysis. In Proceedings of First International Symposium on Universal Communication, 2007.

• T. Shibata, M. Odani, J. Harashima, T. Oonishi, and S. Kurohashi. SYNGRAPH: A flexible matching method based on synonymous expression extraction from an ordinary dictionary and a web corpus. In Proceedings of IJCNLP2008, 2008.

• K. Shinzato, T. Shibata, D. Kawahara, C. Hashimoto, and S. Kurohashi. TSUBAKI: An open search engine infrastructure for developing new information access methodology. In Proceedings ofIJCNLP2008, 2008.


Recommended