Download - Semantic Knowledge Discovery, Organization and UseCollaboratively Constructed Semantic Resources on the Web Iryna Gurevych Technical University of Darmstadt, Germany On Deck: Double

Semantic KnowledgeDiscovery, Organization and Use

Warren Weaver Hall, New York University

November, 14 and 15, 2008

NSF Sponsored Symposium

Warren Weaver Hall, New York University Semantic Knowledge Discovery, Organization and Use

Next...

P-1The Development of a Shared Dataset for Predictive

Analysis in the Behavioral Sciences

Kai R. Larsen, Jintae Lee, Eliot RichU. Colorado

On Deck:Double Deck:

P-2 Catherine HavasiP-3 Iryna Gurevych


P-1

The Development of a Shared Dataset for Predictive Analysis in

the Behavioral Sciences Kai Larsen, Jintae Lee, and Eliot Rich

0

200,000

400,000

600,000

800,000

1,000,000

1,200,000

1,400,000

Re

lati

on

ship

s (i

n T

ho

usa

nd

s) Unknown

Known

Setting: A large portion of behavioral

research focuses on very distinct

knowledge constructs and their

relationships.

Problem: For every behavioral paper

published…

• known relationships increase linearly

• unknown relationships increase

exponentially

X

Y

Z

0 5000 10000exponentially

Solution:

• Collect large dataset of behavioral

constructs and their relationships

• Make this available to the community of

knowledge discovery researchers

• By automatically figuring out 1% of the

relationships, a researcher could

contribute more to science than 100,000

behavioral researchers could do in their

lifetimes.


Next...

P-2Discovering Semantic Relations Using Singular Value

Decomposition Based Techniques

Catherine HavasiBrandeis University


P-3 Iryna GurevychP-4 Roy Bar-Haim


P-2

Acquiring and UsingCommon Sense

• Acquire Common Sense– From human volunteers– From inference– From corpora

• Using Dimensionality Reduction– To learn more common sense– To add common sense intuition to domain

specific data


Next...

P-3Putting the Wisdom-of-Crowds to Use in NLP:

Collaboratively Constructed Semantic Resources on the Web

Iryna GurevychTechnical University of Darmstadt, Germany


P-4 Roy Bar-HaimP-5 Derrick Higgins


P-3

Putting the „Wisdom‐of‐Crowds“ to Use in NLP: Collaboratively Constructed Semantic Resources on the WebUbiquitous Knowledge Processing Lab, Iryna Gurevych

Information Extraction (Ruiz-Casado et al., 2005)Information Retrieval (Gurevych et al., 2007)Named Entity Recognition (Bunescu & Pasca, 2006)Question Answering (Ahn et al., 2004)Text Categorization (Gabrilovich & Markovitch, 2006)

Semantic Relatedness (Zesch et al., 2008)Information Retrieval (Müller and Gurevych, 2008)

Wikipedia Wiktionary WordNet GermaNet ...JWPL JWKTL JWNL GN API ...

Mapping

InformationExtraction

InformationRetrieval

LexicalChains

LexicalGraphs

Named EntityRecognition

QuestionAnswering

SemanticRelatedness

TextCategorization

TextSegmentation

TextSummarization

Word SenseDisambiguation ...

Unified access

Entity• Part of Speech• Lexeme / Sense •pairs

Lexical Relations• Synonymy• Antonymy

Semantic Relation• Hypernymy• Hyponymy• …

Explicit lexical-semantic relations

Advantages of collaborative construction

Abbreviations, Antonyms, Categories, Collocations, Derived Terms, Etymology, Examples, Glosses, Hypernyms, Hyponyms, Morphology, Part-of-speech, Pronunciation, Quotations, Related terms, Synonyms, Translations, Troponyms, Word senses

BigMulti-lingualCheapUp-to-date

Mapping Mapping Mapping

Open issues with “the user‐contributed information”

• incompleteness of information • inconsistent structure of entries • uneven coverage• vagueness of concepts• insufficient quality of information

This work is funded by the German Research Foundation (DFG GU 798/1‐2, 798/1‐3, and 798/3‐1) and the Volkswagen‐Foundation (I/82806)

Wikipedia & Wiktionary API http://www.ukp.tu-darmstadt.de/software


Next...

P-4Efficient Semantic Inference over Language Expressions

Roy Bar-HaimBar-Ilan University


P-5 Derrick HigginsP-6 Jung-Wei Fan


P-4


Next...

P-5Length-independent vector-space document similarity

measures

Derrick HigginsEducational Testing Service


P-6 Jung-Wei FanP-7 Peter Clark


P-5

Length-independent vector-space document similarity measures

Derrick Higgins, ETS

The Problem: Similarity and LengthThe similarity between two texts, as estimated by vector-basedmethods (CVA, LSA, RI,. . .) depends not only on their congruenceof meaning, but also on the lengths of the texts compared

Documents on similar topics will converge to similar representationvectors as their length increases.

Longer documents are more likely to appear similar thanshorter ones.

Even documents on different topics may exhibit some increase insimilarity scores with increasing length.

500 1000 1500 2000

−0.

20.

00.

20.

40.

60.

81.

0

gTypes

CV

A S

imila

rity

500 1000 1500 2000

−0.

20.

00.

20.

40.

60.

81.

0gTypes

RI S

imila

rity

One simple way to remove effect of text length is to subtract anestimate of similarity based on length, leaving the residual

Length-independent vector-space document similarity measures – p.1


Next...

P-6Semantic reclassification of ontology concepts using

contextual and lexical features

Jung-Wei Fan, Carol FriedmanColumbia University


P-7 Peter ClarkP-8 Alexander Yates


P-6

Semantic Reclassification of Ontological Concepts using Contextual and Lexical Features

Jung-Wei Fan, MS, MPhil Carol Friedman, PhDDepartment of Biomedical Informatics

Columbia University, New York

Semantic TypeFinding Disease-related concepts

Progressive renal failure,Hyperkalemia, etc.

Function-related conceptsNitrogen balance,Mitotic activity, etc.

Procedure-related conceptsAppendico-vesicostomy, etc.

General finding conceptsUnemployment,Beer drinker, etc.

Example:The problem

Methods

Trainingcorpus

Naïve Bayesclassifier

Distributionalclassifier

Bag of wordsfor the classes

Traininglexicon

Contexts forthe classes

Hyperkalemia

Disorder

Training phase Classifying phase

Disorder


Next...

P-7Semantic Knowledge Discovery, Organization and Use: Some

Ongoing Research at Boeing

Peter Clark and Phil HarrisonBoeing Phantom Works


P-8 Alexander YatesP-9 Yutaka Matsuo


P-7

Knowledge Discovery, Organization and Use:Some Ongoing Research at Boeing

Peter Clark, Boeing Phantom Works

1. Developing WordNet (with Princeton and ISI)– 30,000 additional links, glosses in logic, core theories

2. Extracting Commonsense Knowledge from Text– database of 55 million Schubert-style "tuples" – e.g., “planes can be bought”, “pilots can fly to places”, …

3. Recognizing Textual Entailment– use of world knowledge, usingWordNet and DIRT– logical reasoning and explainable decisions

4. Machine Reading – integration of semantic representations from multiple texts


Next...

P-8ShopSmart: Product Recommendations through Technical

Specifications and User Reviews

Alexander YatesTemple University


P-9 Yutaka MatsuoP-10 Hiroyuki TODA


P-8

ShopSmart: Making Recommendations based on Technical Specifications and User Feedback

Alexander Yates1, James Joseph1, Ana-Maria Popescu2

1Computer and Information Sciences, Temple University, Philadelphia, PA, USA 2Yahoo! Labs, Santa Clara, CA, USA


Next...

P-9Social Network Mining from the Web

Yutaka Matsuo, Danushka Bollegala, Hironori Tomobe, YingZi Jin,Junichiro Mori, Keigo Watanabe, Taiki Honma, Masahiro

Hamasaki, Kotaro Nakayama, and Mizuki OkaTokyo University


P-10 Hiroyuki TODAP-11 Atsushi Fujita


P-9

Social Network Mining from the WebYutaka Matsuo and his colleagues, University of Tokyo, Japan

Our solution: POLYPHONET

Network View

At a conference: “Nice to meet you” and ... ?

who is he?

Who are his colleagues?

What is he presenting?

What are his publications?

How is he connected with his

colleagues?


Next...

P-10Geographic Information Retrieval against Immediate

Surroundings

Hiroyuki TODA, Norihito YASUDA, Yumiko MATSUURA, andRyoji KATAOKA

NTT Cyber Solutions Laboratories


P-11 Atsushi FujitaP-12 Saif Mohammad


P-10

Geo-Information Retrieval Against Immediate Surroundings

• What is Geographic Information Retrieval (GIR): – Doc retrieval method using content query(keyword) and geographic query. – Utilize geographic expressions in each document.

• Problems and our propositions:– Ranking:

• Estimate relevancy of each doc against the geo-query and prioritize the docs describing restricted areas related to geo-query.

=> Ranking method which considers extents implied by place names.– Result representation:

• Represent the search results with consideration of geo-constraints and enable the users easily to decide whether to read docs or not even if the screen size is restricted.

=> Query-biased summarization, which utilizes place name expressions related to the geo-query, for GIR result snippets.

Hiroyuki Toda, Norihito Yasuda, Yumiko Matsuura, Ryoji Kataoka (NTT Cyber Solutions Labs.)

• Goal of our GIR: – Realize the searches for spots or services in our immediate surroundings

via mobile communication devices.


Next...

P-11Toward Automatic Compilation of Phrasal Thesaurus

Atsushi Fujita, Satoshi SatoNagoya University


P-12 Saif MohammadP-13 Fabio Massimo Zanzotto


P-11

Toward Automatic Compilationof Phrasal Thesaurus

Phrasal thesaurus Beyond the word-based semantic computing

Deals with various phrasal paraphrases

Atsushi Fujita and Satoshi Sato(Nagoya Univ., JAPAN)

Productive

Non-productive

X wrote Y X is the author of YX solves Y X deals with Y

X show a A Y X v(Y) adv(A)

X V YX V Y X’s V-ing of Y

Y be V-PP by X

burst into tears criedcomfort console

Generate!!

Collect!!

P-11


Next...

P-12Towards Antonymy-Aware Natural Language Applications

Saif Mohammad and Bonnie Dorr, Graeme HirstUniversity of Maryland, University of Toronto


P-13 Fabio Massimo ZanzottoP-14 Nitin Madnani


P-12

Towards Antonymy‐Aware NL Applica6ons Saif Mohammad, Bonnie Dorr, Graeme Hirst

•  Scope –  Clear opposites: wet‐dry, promoted‐demoted –  Contras;ng word pairs: cold‐warm, promoted‐censured

•  Method: –  Iden;fy contras;ng word pairs using seed antonym pairs and

thesaurus categories. –  Determine degree of antonymy using distribu;onal distance and

tendency to co‐occur. •  Evalua;on:

–  950 GRE‐style closest‐opposite ques;ons. •  Results:

–  F score = .70 (baselines: .20 and .22). •  Applica;ons:

–  detec;ng incompa;bles (contradic;ons, sen;ment), genera;ng paraphrases, detec;ng humor, improving distribu;onal thesauri.


Next...

P-13Combining Semi-Unsupervised Acquisition of Corpora and

Supervised Learning of Textual Entailment Rules

Fabio Massimo ZanzottoUniversity of Rome ”Tor Vergata”, Italy


P-14 Nitin MadnaniP-15 Justin Betteridge


P-13

F.M.Zanzotto Saarbrucken 14/6/2007

University of Rome “Tor Vergata”

The Problem: To determine if:

“Kesslers team conducted 60,643 face-to-face interviews with

adults in 14 countries”

“Kesslers team interviewed more than 60,000 adults in 14

countries”

we need

• the equivalence between “X conducted Y interviews with

Z” and “X interviewed Y Z”

• the implication rule that says “X” “more than Y” if “X

is bigger than Y”

Combining Semi-Unsupervised Acquisition of Corpora and

Supervised Learning of Textual Entailment Rules

Fabio Massimo Zanzotto, Marco Pennacchiotti, Alessandro Moschitti


Next...

P-14Applying Automatically Generated Semantic Knowledge A

Case Study in Machine Translation

Nitin Madnani, Philip Resnik, Bonnie Dorr and Richard SchwartzUniversity of Maryland


P-15 Justin BetteridgeP-16 Karin Verspoor


P-14

• No single correct answer for MT• Need multiple correct (human) answers to tune MT system• Expensive to have humans create multiple translations

This Leads To Reference Sparsity!

Automatic Paraphrasing as E-to-E translation

O: We must bear in mind the community as a whole.P: We must remember the wider community.

O: France sent its proposal in the form of a “non-official paper”. P: French transmits its recommendations to serve as a “non- official document”.

O: They should be better coordinated and more effective. P: They should improve the coordination and efficacy.

O:Thirdly, the implications of enlargement for the union’s regional policy cannot be overlooked. P: Finally, the impact of enlargement for EU regional policy cannot be ignored.

Artificial “Reference” Translations (O: original, P: our paraphrase)

Tuning RefsNewswire Web

BLEU TER BLEU TER

1H 37.65 56.39 15.17 70.32

1H+1P 39.32 54.69 15.92 69.94

Significant improvements when using even a single additional artificial

reference for tuning

Applying Automatically Generated Semantic Knowledge:A Case Study in Machine Translation

Nitin Madnani, Philip Resnik, Bonnie Dorr & Richard Schwartz


Next...

P-15Continuous Discovery of Semantic Knowledg

Justin Betteridge, Andrew Carlson, Sue Ann Hong, Estevam R.Hruschka Jr., Edith L. M. Law, Tom M. Mitchell, and Sophie H.

WangCMU


P-16 Karin VerspoorP-18 Svetlana Stoyanchev


P-15

Toward Continuous Discovery of Semantic KnowledgeJustin Betteridge, Andrew Carlson, Sue Ann Hong, Estevam R. Hruschka Jr.,

Edith L. M. Law, Tom M. Mitchell and Sophie H. Wang. Carnegie Mellon University

SubGoal considered here:

• Achieving high semi-supervised

learning accuracy by coupling the

learning of many categories

• Domain: learning semantic classes

Coupling learning of functions f(x), g(x):

1. Propagate initial labeled examples of

f(x) to g(x)

2. Propagate self-labeled examples

3. Use learned instances/patterns of f(x)

Goal: Never-ending language learning

• Domain: learning semantic classes

of NPs

Multi-task learning with explicit

relationships between learning tasks

• subset(organization(x), university(x))

• exclusive(university(x),person(x))

• inverse(parentOf(x,y),childOf(x,y))

• childOf(x,y) => person(x) ^ person(y)

…

3. Use learned instances/patterns of f(x)

to assess patterns/instances of g(x)

Coupling country city company univ. mean

1 93.6 99.1 100.0 79.1 93.0

1,2,3 89.1 98.2 100.0 97.3 96.2

Bootstrap learning accuracy: iteratively

labeling 110 new examples from 8M web

pages


Next...

P-16The Colorado OpenDMAP system: Building on Community

Ontologies and a Community, Platform for BiomedicalNatural Language Processing

Karin Verspoor, William Baumgartner, Kevin Cohen, HelenJohnson, and Larry HunterUniversity of Colorado Denver


P-18 Svetlana StoyanchevP-19 Jordan Boyd-Graber


P-16

The Colorado OpenDMAP system Karin Verspoor, William Baumgartner, K. Bretonnel Cohen, Helen Johnson, Larry Hunter

Cyclin E2 interacts with Cdk2 in a func>onal kinase complex.

protein protein interac>on: interactor1: cyclin E2 interactor2: cdk2

ontology paDerns

OpenDMAP

freetext

extracted informa>on

CLASS: protein protein interac>on SLOT: interactor1 TYPE: molecule SLOT: interactor2 TYPE: molecule

PROTÉGÉ ONTOLOGY

{c‐interact} := [interactor1] interacts with [interactor2] {c‐interact} := [interactor1] is bound by [interactor2] …

PATTERNS

An ontology‐driven integrated concept recogni>on system with proven applicability to biomedical informa>on extrac>on problems.


Next...

P-18Automatic Feature Discovery for Predicting Content of User

Utterances in Dialogs

Svetlana StoyanchevSUNY, Stony Brook


P-19 Jordan Boyd-GraberP-20 Breck Baldwin


P-18

Predicting Content of User Utterances in Dialog

Svetlana Stoyanchev and Amanda StentSUNY, Stony Brook

Two-pass ASR approach:1. predict presence of task-relevant concepts in user

Pro

ble

m Goal: build dialog systems that allow users to speak freely. Automatic speech recognition (ASR) is a big issue (typical ASR error rate in dialog ~30%)

1. predict presence of task-relevant concepts in user utterances using:1. lexical features recognized by the first-pass of the ASR

2. dialog history features 3. prosodic features from the user’s speech

2. Adapt language model to the predicted content

Appro

ach

Result We achieve statistically significant (but small) improvements in second-pass ASR accuracy for one dialog context; plan to expand to others

Today: Performance

of different methods

of choosing lexical

features


Next...

P-19Syntactic Topic Models

Jordan Boyd-Graber and David M. BleiPrinceton University


P-20 Breck BaldwinP-21 Cliff Joslyn


P-19

α

αT

β

πk

τk

∞ M

θd

αD

σ

Syntactic Topic Models Jordan Boyd-Graber and David Blei

Princeton University

Documents are collections of parse trees.

z1

w2:slept

w3:they

w1:START

z2

z3

z4 w4:START

z5 w2:ran

The latent class depends on the parent node and the document's topic distribution.

Syntactic Topic ModelsJordan Boyd-Graber and David M. Blei

Princeton University Department of Computer Science

{jbg,blei}@princeton.edu

Both syntactic models and topic models are active, fruit-ful areas of research. One captures local patterns, and theother captures trends across many documents. To illustratethese di!erent but complementary views, consider the fol-lowing incomplete sentence from a travel brochure, “In aweek, you could go to .” A syntactic model such as the in-finite tree with independent children [1] tells us what wordscould be an object of a preposition (e.g., “bed,” “school,”“debt,”), and a topic model such as the hierarchical Dirich-let process (HDP) [5] could tell us what words fit with atravel theme (“vacation,” “relax,” “exotic,” etc.). In thiswork, we develop a model that can combine the constraintsof both syntax and semantics to build categories of wordsthat are consistent with both.

To do this, we build a model called the syntactic topicmodel (STM). Using a corpus composed of dependency parsetrees collected into documents, the STM learns “topics” thatare both thematically and syntactically consistent. Thesetopics, like the parts of speech in syntactic models or thesyntactically-uninformed topics in topic models, are distri-butions over the lexicon.

To incorporate syntax and semantics, the STM combinesthe per-document distributions over topics (as in topic mod-els) with the part of speech transition probabilities (as insyntactic models). It does this by taking the point-wiseproduct of these distributions and then selecting a latentclass for each word from this new, renormalized distribution,similar to the product of experts model [2]. More formally,the full generative model of the corpus is:

1. Choose global topic weights ! ! GEM(!)2. For each topic index k = {1, . . . }:

(a) Choose topic "k ! Dir("#u)(b) Choose topic transition distribution $k ! DP(!T , !)

!

!T

"

#k

$k

% M

&d

!D

'

(a) Overall Graphical Model

z1

w2:lay

w3:phrase

w7:forw7:his

w5:some w6:mind

w1:START

w9:year

w4:in

z2

z3 z4

z5

z6

z7

z8

z9

(b) Sentence GraphicalModel

Figure 1: Graphical model for a syntactic topic model (left); ingreater detail is the graphical model for each sentence (right).

his, their, other, us, its, last, one, all

0.42

0.10

0.57

0.06

0.26

0.29

0.08

0.31

0.67

0.06

0.28

policy, gorbachev,

mikhail, leader, soviet, restructuring,

software

0.95

START

garden, visit, having, aid,

prime, despite, minister,

especially

0.37

television, public,

australia, cable, host, franchise,

service

0.34

says, could,

can, did, do, may, does, say

0.11

they, who, he, there, one, we, also, if

0.11

mr, inc, co, president,

corp, chairman,

vice, analyst,

europe, eastern,

protection, corp, poland,

hungary, chapter, aid

0.52

shares, quarter,

market, sales, earnings, interest,

months, yield

0.22

0.25

0.09

Figure 2: On hand-parsed documents, the STM discovered twocategories of topics. Some topics (shaded with grey) were sharedacross almost all documents and filled the role of a generic partof speech, not reflecting any thematic specification. Other topics,however, are selected by a document’s semantic constraints.

3. For each document d = {1, . . . M}:(a) Choose topic weights %d ! DP(!D, !)(b) For each sentence root node:

i. Choose topic assignment z0 " %d$start

ii. Choose root word wd,0 ! mult(1, #z0 )

(c) For each additional word wd,n and parent pn, n # {1, . . . dn}i. Choose topic assignment zd,n " %d$zp(d,n)

ii. Choose word wd,n ! mult(1, "zd,n )

To discover the best configuration of these unobservedvariables in our generative process we use variational infer-ence for nonparametric Bayesian models [3]. This processuncovers the best top-level weights, topic transitions, per-document topic distributions, topic assignments, and topics.

We fit the STM to the Penn Treebank [4]. Instead ofgrouping all nouns into a single topic, some parts of speech(such as nouns and adjectives) are divided into specializedsyntactic groups that appear in similar documents (Fig-ure 2), but other parts of speech such as verbs and preposi-tions are shared across many documents. Quantitatively, theSTM also did better in predicting words on held-out data;its perplexity on held out documents was better (lower) thanthe HDP or the infinite tree.

References[1] J. R. Finkel, T. Grenager, and C. D. Manning. The infinite tree.

In ACL, pages 272–279, Prague, Czech Republic, June 2007. As-sociation for Computational Linguistics.

[2] G. Hinton. Products of experts. In Proceedings of the Ninth In-ternational Conference on Artificial Neural Networks, pages 1–6,Edinburgh, Scotland, 1999. IEEE.

[3] P. Liang, S. Petrov, M. Jordan, and D. Klein. The infinite PCFGusing hierarchical Dirichlet processes. In HLT, pages 688–697,2007.

[4] M. P. Marcus, B. Santorini, and M. A. Marcinkiewicz. Building alarge annotated corpus of English: The Penn treebank. Computa-tional Linguistics, 19(2):313–330, 1994.

[5] Y. W. Tee, M. I. Jordan, M. J. Beal, and D. M. Blei. Hierarchicaldirichlet processes. JASA, 101(476):1566–1581, December 2006.

Learned topics are consistent with both syntax and theme.

Poster 19


Next...

P-20Is Semantics Just Picking the Right Syntax for the Context

from Multiple possiblties?

Breck BaldwinAlias-i


P-21 Cliff JoslynP-23 Eiman Tamah Al-Shammari


P-20

P-20Is Semantics Just Picking the Right Syntax for the Context

from Multiple possiblties?

Breck BaldwinAlias-i


Next...

P-21Semantic Hierarchies: Induction, Measurement, and

Management

Cliff Joslyn, Michelle Gregory, Liam McGrath, Patrick Paulson,Karin Verspoor

Pacific Northwest National Laboratory, University of Colorado Denver


P-23 Eiman Tamah Al-ShammariP-24 Kimiaki Shirahama


P-21

Semantic Hierarchies: Induction, Measurement, and Management

Concept lattices

Semantic hierarchies from relational data

Semantic Hierarchies:Cores of ontologies80-90% of links in real-world ontologiesBecoming large:

104-106 nodes

Need for algorithms and measures

Induction from text

Visualization, annotation

Alignment, matching

Mathematical order theoryMetrics: Distances and similarities

Wordet

Gene Ontology

relational data

Capture implicationrelations dually between objects, attributes

Unbiased, graphical, visual representation

Metrics: Distances and similarities based on semi-modular valuation functionsRanks: Structure of vertical levels Morphisms: Mappings and linkages

Issues for knowledge systemsEnable robust use of multiple inheritance: beyond trees!Avoid risks of pure graph theory, path-counting methodsProper use of vertical levels


Next...

P-23Syntactical Knowledge usage to Reduce Arabic/English

Stemming Errors

Eiman Tamah Al-ShammariKuwait University, George Mason University


P-24 Kimiaki ShirahamaP-25 Kazuhiro Seki


P-23

P-23Syntactical Knowledge usage to Reduce Arabic/English

Stemming Errors

Eiman Tamah Al-ShammariKuwait University, George Mason University


Next...

P-24Characteristics of Textual Information in Video Data from

the Perspective of Natural Language Processing

Kimiaki Shirahama, Akihito Mizui and Kuniaki UeharaKobe University


P-25 Kazuhiro SekiP-28 Marine Carpuat


P-24

Characteristic of Textual Information in Video DataCharacteristic of Textual Information in Video Datafrom the Perspective of Natural Language Processingfrom the Perspective of Natural Language Processing

Topic detection in videos using utterances obtained by ASR method (ASR transcripts)→ Efficient search and browsing of a video archivePurposePurpose

Video

Audio

Text documentText document VideoVideoAll the semantic contents are conveyedonly through a text medium.

Semantic contents are conveyed through synchronizedvideo and audio media in a complementary manner.Synergy between video and audio media

Pattern of word occurrences

Preliminary examination of whether NLP methods can appropriatelyPreliminary examination of whether NLP methods can appropriately process ASR transcriptsprocess ASR transcripts

Trigger pair extraction → NLP methods cannot treat temporal distributions of spoken words.Topic extraction by LDA → The same words are commonly spoken in different words.The same word is not spoken so many times. → Burst detection based on character’s appearance

Trigger pair….. President Kennedy had embarked on a tour of Texas in an effort to raise campaign funds and to unite party members. The President, accompanied by Vice-President Lyndon B. Johnson, Texas Governor John Connally, ….. The motorcade started a few minutes late but managed to proceed close to its schedule. The crowds were exuberant, encroaching on every vantage point along the route. ….. Incidents such as that, the clearing weather, the bright warm sun, and the tremendous and loudly cheering crowds were exactly what the president needed. ….. The Kennedy magic was at its best. Then, more than halfway along the route through Dallas, and just as the motorcade broke through the heaviest street crowds, ….. Shots echoed through Dealey Plaza. President Kennedy was mortally wounded, Governor Connally was seriously wounded. …..

Burst

time

Oh my god!President, please shake with me.

Kennedy, it’s fine today.

Yeah. Sure.I’m happy many people cometo this campaign at Dallas.

Oh no! Jesus Christ!

Discuss how to improve NLP methods for ASR transcript processing!


Next...

P-25Biomedical Association Discovery via Complementary TDM

Kazuhiro Seki and Kuniaki UeharaKobe University


P-28 Marine CarpuatP-29 Rion Snow


P-25

K. Seki & K. Uehara at Kobe University (PK. Seki & K. Uehara at Kobe University (P‐‐25)25)

Text data mining (TDM)

Explicit information Implicit information

IR, IE, Classification, Summarization, etc.

hypothesis discovery orliterature‐based discovery

G O t l t ti G ti i ti di

Genesg2g2g1g1 lglarticle

Gene Ontology annotation Genetic association discovery

Phenotypes

Gene functions

p1p1 p2p2 pnpn

1f1 2f2 3f3 f 1fm‐1 fmfm

GO

negative

positive

annotationRepeat foreach gene

DiseasedCCBP MF


Next...

P-28Word Sense Disambiguation for Statistical Machine

Translation

Marine CarpuatColumbia University


P-29 Rion SnowP-30 Delip Rao


P-28

WordSenseDisambigua1onforSta1s1calMachineTransla1on

MarineCarpuatColumbiaUniversityCenterforComputa1onalLearningSystems

  MostSMTsystemsdonotexplicitlyuseWSD  sta1ctransla7onprobabili7es,notsensi7vetocontext

  ButusingWSDforSMTfirstgaveconfusingresults  WSDforSMThurtsBLEUscore!?[Carpuat&WuACL‐2005]

  ButWSDshouldhelpSMT…[Carpuat&WuIJCNLP‐05]

  GeneralizingWSDtoPhraseSenseDisambigua0onforSMT[Carpuat&Wu,2007]  PSDisfullyphrasaljustlikeconven7onalSMTlexicons

  PSDpredic7onsarefullyintegratedinSMTdecoding

  PSDmodelsaretrainedonthesameparalleldataasSMTlexicons

PSDimprovestransla7onqualityconsistentlyon8metricsand4tasks


Next...

P-29Crowdsourcing Annotations for Natural Language Tasks: An

Evaluation

Rion SnowStanford University


P-30 Delip RaoP-31 James Mayfield


P-29

Crowdsourcing Annotations for Natural Language Tasks: An Evaluation

• What would you do if you had an on-demand army of thousands of annotators?

• 10,000 labels / day

• 1,000 labels / dollar

• Expert-quality labeling or better (with some tricks)

• Results on five natural language tasks

Cheap and Fast - But is it Good? Snow et al., EMNLP-2008


Next...

P-30Bootstrapping Extraction Patterns from Wikipedia

Delip RaoJHU


P-31 James MayfieldD-1 Daniel Tunkelang


P-30

P-30Bootstrapping Extraction Patterns from Wikipedia

Delip RaoJHU


Next...

P-31Knowledge Base Evaluation for Semantic Knowledge

Discovery

James Mayfield, Bonnie Dorr, Tim Finin, Douglas Oard andChristine Piatko

Human Language Technology Center of Excellence


D-1 Daniel TunkelangD-3 David Nadeau


P-31

Mayfield, Dorr, Finin, Oard, Piatko NYU Symposium on Semantic Knowledge Discovery, Organization and Use

Knowledge Base Evaluationfor Semantic Knowledge Discovery

• Key idea: evaluate knowledge base, not extraction output• Six evaluation axes

– Accuracy– Usefulness– Augmentation– Explanation– Adaptation– Temporal Qualification

• This approach has many advantages!

KBStructured Knowledge

Entities

Events

Relations

PERSONAli Hassan al-Majidيتيركتلا ديجملا دبع نسح يلعDOB: 1941Citizenship: IraqPosition: Defense Minister

ORGANIZATIONJihaz al-Mukhabarat al-AmmaAKA: Jihaz al-KhasCountry: Iraq

Evaluate

Ali Hassan al-Majid

يتيركتلا ديجملا دبع نسح يلع

Evaluate


Next...

D-1Unsupervised Annotation and Exploratory Search

Daniel TunkelangEndeca


D-3 David NadeauD-4 Gregory Marton


D-1


Next...

D-3Demo of Semi-Supervised Named Entity Recognition at

OpenPlaces

David NadeauOpenplaces


D-4 Gregory MartonD-5 Mona Diab


D-3

open

- Minimal human input

- Web page wrapper induction

Named

openplaces

inimal human input

eb page wrapper induction

Semi-supervised

Named Entity

places

inimal human input


supervised

ntity Recognition

placestm


ecognition

- T

- 1 trillion ‘relations'

Travel ontology

1 trillion ‘relations'

Travel ontology

ravel ontology

1 trillion ‘relations'

Travel ontology

Semantic Search Engine

for the Travel domain

Search Engine

Travel domain

Search Engine

Travel domain


Next...

D-4Procedure Discovery for Time Expression Understanding

Gregory MartonMIT


D-5 Mona DiabD-6 Michael Paul


D-4

Procedure Discoveryfor Time Expression Understanding

Gregory [email protected]

Existing Lexicon"tomorrow" : (λ.t (.add t 1 'day))

"May Day" : (λ.t (.near t #:month 5 #:day 1))

"Thrusday" : (λ.t (.near t #:day-of-week 4))

...

Learned Semantics

Unseen Word

"World AIDS Day""Veterans Day""May Day""Thanksgiving"...

"Thrusday" "Earth Day"

"Thursday""Thruway""Tuesday"...

Source Semantics

DistributionallySimilar Words

(λ.t (.near t #:month 11 #:day 11))"Earth Day" : (λ.t (.near t #:month 4 #:day 22)) (λ.t (.near t #:month 11 #:day-of-week 4 #:nth 4))

VAL="2003-04-22"

VAL="2001-09-11"

(λ.t (.near t #:month 5 #:day 1))

Unseen Meaning"9/11"

VAL="2003-04-22"

"9/11" : (λ.t (.set-value t "2001-09-11"))

"Thursday" : (λ.t (.near t #:day-of-week 4))


Next...

D-5SALAMCAT: Sense Assignment Leveraging Alignments,

Monolingual Contexts And Translations

Mona Diab and Weiwei GuoColumbia University


D-6 Michael PaulD-7 Emily Jamison


D-5

D-5SALAMCAT: Sense Assignment Leveraging Alignments,

Monolingual Constexts And Translations

Mona Diab and Weiwei GuoColumbia University


Next...

D-6AIRTA: An Automatic Inter-disciplinary Research Topic

Advisor - Where are We and Where do We Go -

Michael Paul and Roxana GirjuUniversity of Illinois at Urbana-Champaign


D-7 Emily JamisonD-8 Toru Hirano


D-6

Michael Paul† and Roxana Girju‡Departments of Computer Science(† ‡) and Linguistics (‡), Beckman Institute († ‡)

University of Illinois at Urbana-Champaign{mjpaul2, girju}@illinois.edu

IntroductionWe believe that like other disciplines, computational linguistics will drastically benefit from an inter-disciplinary perspective.

Our tool is designed to foster interdisciplinary research in order to make breakthrough predictions for future directions.

This is accomplished by analysing trends within and across relevant fields and then automatically suggesting new research directions and topics.

Some fields motivating research in computational linguistics

Trends/AnalysisBecause our data is categorized and labelled by year, we can see how research in certain fields rises and declines over time.

We can use this informationto gauge which topics areimportant and which areasare saturated.

We also look forcorrelations in trendsin similar fields acrossdifferent disciplines.

Next StepThe next phase of this project (the final goal) will be to generate new topics. The key is to discover topics that are important in one discipline but have been studied little in another. These suggestions will be useful to professionals who would like to engage in research discussions with other parties, but who are not familiar with those areas. It will be beneficial to students looking for novel research topics.

Back EndWe currently have a database with:

4,700 papers from computational linguistics conferences

2,300 papers from linguistics journals

1,700 papers from education/educational psychology journals

We will enlarge our corpus as we continue to work on this project.

ClassificationWe categorized these papers mostly using Latent Dirichlet Allocation (LDA) with words from titles, abstracts, and full text when available.

AIRTA: An Automatic Interdisciplinary Research Topic Advisor- Where are We and Where do We Go -

Industry

Linguistics MachineLearning

CognitivePsychology

Education

Lexical Semantics lexical entries semantic word idioms words lexiconMorphology morphological word morphology lexical level formsMT Evaluation evaluation score human scores sentence automatic

Named EntitiesMultimodal NLP multimodal speech gesture user language input

entity names named entities ne information person

Each dot represents a paper in the “Dialogue Systems” category. The coloring shows how papers can span multiple categories.

A sample of categories and the top keywords associated with them

Language-related topics comprise the bulk of research in education and are steadily

increasing in prominence.


Next...

D-7CACTUS: A User-friendly Toolkit for Semantic

Categorization and Clustering in the Open Domain

Emily JamisonThe Ohio State University

On Deck: D-8 Toru Hirano


D-7

Open-domainNo Training Required

Easy-to-use GUI

Near-universal coverageInternet as Knowledge SourceOr, command-line interface

CACTUS: A User-friendly Toolkit for SemanticCategorization and Clustering in the Open Domain

Emily K. Jamison CACTUS: A 1-Slide Introduction


Next...

D-8Aggregating Knowledge of Named Entity Relations

Toru Hirano, Yoshihiro Matsuo, and Genichiro KikuiNTT Cyber Space Laboratories


D-8

Geographicdatabase

D-8: Aggregating Knowledge of Named Entity Relations

“George Bush is the President of the U.S”

NY

the U.S.

NE2:String

New YorkCity-010

United Statesof America-001

NE2:ID

Speech

President

Relationship

George W.Bush-001

Bush

George W.Bush-001

GeorgeBush

NE1:IDNE1:String

Web

[ President, George Bush, the U.S. ]

Wikipedia

Relational Database

Extractor

8 million records from 14 million web pages