Semantic KnowledgeDiscovery, Organization and Use
Warren Weaver Hall, New York University
November, 14 and 15, 2008
NSF Sponsored Symposium
Warren Weaver Hall, New York University Semantic Knowledge Discovery, Organization and Use
Next...
P-1The Development of a Shared Dataset for Predictive
Analysis in the Behavioral Sciences
Kai R. Larsen, Jintae Lee, Eliot RichU. Colorado
On Deck:Double Deck:
P-2 Catherine HavasiP-3 Iryna Gurevych
Warren Weaver Hall, New York University Semantic Knowledge Discovery, Organization and Use
P-1
The Development of a Shared Dataset for Predictive Analysis in
the Behavioral Sciences Kai Larsen, Jintae Lee, and Eliot Rich
0
200,000
400,000
600,000
800,000
1,000,000
1,200,000
1,400,000
Re
lati
on
ship
s (i
n T
ho
usa
nd
s) Unknown
Known
Setting: A large portion of behavioral
research focuses on very distinct
knowledge constructs and their
relationships.
Problem: For every behavioral paper
published…
• known relationships increase linearly
• unknown relationships increase
exponentially
X
Y
Z
0 5000 10000exponentially
Solution:
• Collect large dataset of behavioral
constructs and their relationships
• Make this available to the community of
knowledge discovery researchers
• By automatically figuring out 1% of the
relationships, a researcher could
contribute more to science than 100,000
behavioral researchers could do in their
lifetimes.
Warren Weaver Hall, New York University Semantic Knowledge Discovery, Organization and Use
Next...
P-2Discovering Semantic Relations Using Singular Value
Decomposition Based Techniques
Catherine HavasiBrandeis University
On Deck:Double Deck:
P-3 Iryna GurevychP-4 Roy Bar-Haim
Warren Weaver Hall, New York University Semantic Knowledge Discovery, Organization and Use
P-2
Acquiring and UsingCommon Sense
• Acquire Common Sense– From human volunteers– From inference– From corpora
• Using Dimensionality Reduction– To learn more common sense– To add common sense intuition to domain
specific data
Warren Weaver Hall, New York University Semantic Knowledge Discovery, Organization and Use
Next...
P-3Putting the Wisdom-of-Crowds to Use in NLP:
Collaboratively Constructed Semantic Resources on the Web
Iryna GurevychTechnical University of Darmstadt, Germany
On Deck:Double Deck:
P-4 Roy Bar-HaimP-5 Derrick Higgins
Warren Weaver Hall, New York University Semantic Knowledge Discovery, Organization and Use
P-3
Putting the „Wisdom‐of‐Crowds“ to Use in NLP: Collaboratively Constructed Semantic Resources on the WebUbiquitous Knowledge Processing Lab, Iryna Gurevych
Information Extraction (Ruiz-Casado et al., 2005)Information Retrieval (Gurevych et al., 2007)Named Entity Recognition (Bunescu & Pasca, 2006)Question Answering (Ahn et al., 2004)Text Categorization (Gabrilovich & Markovitch, 2006)
Semantic Relatedness (Zesch et al., 2008)Information Retrieval (Müller and Gurevych, 2008)
Wikipedia Wiktionary WordNet GermaNet ...JWPL JWKTL JWNL GN API ...
Mapping
InformationExtraction
InformationRetrieval
LexicalChains
LexicalGraphs
Named EntityRecognition
QuestionAnswering
SemanticRelatedness
TextCategorization
TextSegmentation
TextSummarization
Word SenseDisambiguation ...
Unified access
Entity• Part of Speech• Lexeme / Sense •pairs
Lexical Relations• Synonymy• Antonymy
Semantic Relation• Hypernymy• Hyponymy• …
Explicit lexical-semantic relations
Advantages of collaborative construction
Abbreviations, Antonyms, Categories, Collocations, Derived Terms, Etymology, Examples, Glosses, Hypernyms, Hyponyms, Morphology, Part-of-speech, Pronunciation, Quotations, Related terms, Synonyms, Translations, Troponyms, Word senses
BigMulti-lingualCheapUp-to-date
Mapping Mapping Mapping
Open issues with “the user‐contributed information”
• incompleteness of information • inconsistent structure of entries • uneven coverage• vagueness of concepts• insufficient quality of information
This work is funded by the German Research Foundation (DFG GU 798/1‐2, 798/1‐3, and 798/3‐1) and the Volkswagen‐Foundation (I/82806)
Wikipedia & Wiktionary API http://www.ukp.tu-darmstadt.de/software
Warren Weaver Hall, New York University Semantic Knowledge Discovery, Organization and Use
Next...
P-4Efficient Semantic Inference over Language Expressions
Roy Bar-HaimBar-Ilan University
On Deck:Double Deck:
P-5 Derrick HigginsP-6 Jung-Wei Fan
Warren Weaver Hall, New York University Semantic Knowledge Discovery, Organization and Use
P-4
Warren Weaver Hall, New York University Semantic Knowledge Discovery, Organization and Use
Next...
P-5Length-independent vector-space document similarity
measures
Derrick HigginsEducational Testing Service
On Deck:Double Deck:
P-6 Jung-Wei FanP-7 Peter Clark
Warren Weaver Hall, New York University Semantic Knowledge Discovery, Organization and Use
P-5
Length-independent vector-space document similarity measures
Derrick Higgins, ETS
The Problem: Similarity and LengthThe similarity between two texts, as estimated by vector-basedmethods (CVA, LSA, RI,. . .) depends not only on their congruenceof meaning, but also on the lengths of the texts compared
Documents on similar topics will converge to similar representationvectors as their length increases.
Longer documents are more likely to appear similar thanshorter ones.
Even documents on different topics may exhibit some increase insimilarity scores with increasing length.
500 1000 1500 2000
−0.
20.
00.
20.
40.
60.
81.
0
gTypes
CV
A S
imila
rity
500 1000 1500 2000
−0.
20.
00.
20.
40.
60.
81.
0gTypes
RI S
imila
rity
One simple way to remove effect of text length is to subtract anestimate of similarity based on length, leaving the residual
Length-independent vector-space document similarity measures – p.1
Warren Weaver Hall, New York University Semantic Knowledge Discovery, Organization and Use
Next...
P-6Semantic reclassification of ontology concepts using
contextual and lexical features
Jung-Wei Fan, Carol FriedmanColumbia University
On Deck:Double Deck:
P-7 Peter ClarkP-8 Alexander Yates
Warren Weaver Hall, New York University Semantic Knowledge Discovery, Organization and Use
P-6
Semantic Reclassification of Ontological Concepts using Contextual and Lexical Features
Jung-Wei Fan, MS, MPhil Carol Friedman, PhDDepartment of Biomedical Informatics
Columbia University, New York
Semantic TypeFinding Disease-related concepts
Progressive renal failure,Hyperkalemia, etc.
Function-related conceptsNitrogen balance,Mitotic activity, etc.
Procedure-related conceptsAppendico-vesicostomy, etc.
General finding conceptsUnemployment,Beer drinker, etc.
Example:The problem
Methods
Trainingcorpus
Naïve Bayesclassifier
Distributionalclassifier
Bag of wordsfor the classes
Traininglexicon
Contexts forthe classes
Hyperkalemia
Disorder
Training phase Classifying phase
Disorder
Warren Weaver Hall, New York University Semantic Knowledge Discovery, Organization and Use
Next...
P-7Semantic Knowledge Discovery, Organization and Use: Some
Ongoing Research at Boeing
Peter Clark and Phil HarrisonBoeing Phantom Works
On Deck:Double Deck:
P-8 Alexander YatesP-9 Yutaka Matsuo
Warren Weaver Hall, New York University Semantic Knowledge Discovery, Organization and Use
P-7
Knowledge Discovery, Organization and Use:Some Ongoing Research at Boeing
Peter Clark, Boeing Phantom Works
1. Developing WordNet (with Princeton and ISI)– 30,000 additional links, glosses in logic, core theories
2. Extracting Commonsense Knowledge from Text– database of 55 million Schubert-style "tuples" – e.g., “planes can be bought”, “pilots can fly to places”, …
3. Recognizing Textual Entailment– use of world knowledge, usingWordNet and DIRT– logical reasoning and explainable decisions
4. Machine Reading – integration of semantic representations from multiple texts
Warren Weaver Hall, New York University Semantic Knowledge Discovery, Organization and Use
Next...
P-8ShopSmart: Product Recommendations through Technical
Specifications and User Reviews
Alexander YatesTemple University
On Deck:Double Deck:
P-9 Yutaka MatsuoP-10 Hiroyuki TODA
Warren Weaver Hall, New York University Semantic Knowledge Discovery, Organization and Use
P-8
ShopSmart: Making Recommendations based on Technical Specifications and User Feedback
Alexander Yates1, James Joseph1, Ana-Maria Popescu2
1Computer and Information Sciences, Temple University, Philadelphia, PA, USA 2Yahoo! Labs, Santa Clara, CA, USA
Warren Weaver Hall, New York University Semantic Knowledge Discovery, Organization and Use
Next...
P-9Social Network Mining from the Web
Yutaka Matsuo, Danushka Bollegala, Hironori Tomobe, YingZi Jin,Junichiro Mori, Keigo Watanabe, Taiki Honma, Masahiro
Hamasaki, Kotaro Nakayama, and Mizuki OkaTokyo University
On Deck:Double Deck:
P-10 Hiroyuki TODAP-11 Atsushi Fujita
Warren Weaver Hall, New York University Semantic Knowledge Discovery, Organization and Use
P-9
Social Network Mining from the WebYutaka Matsuo and his colleagues, University of Tokyo, Japan
Our solution: POLYPHONET
Network View
At a conference: “Nice to meet you” and ... ?
who is he?
Who are his colleagues?
What is he presenting?
What are his publications?
How is he connected with his
colleagues?
Warren Weaver Hall, New York University Semantic Knowledge Discovery, Organization and Use
Next...
P-10Geographic Information Retrieval against Immediate
Surroundings
Hiroyuki TODA, Norihito YASUDA, Yumiko MATSUURA, andRyoji KATAOKA
NTT Cyber Solutions Laboratories
On Deck:Double Deck:
P-11 Atsushi FujitaP-12 Saif Mohammad
Warren Weaver Hall, New York University Semantic Knowledge Discovery, Organization and Use
P-10
Geo-Information Retrieval Against Immediate Surroundings
• What is Geographic Information Retrieval (GIR): – Doc retrieval method using content query(keyword) and geographic query. – Utilize geographic expressions in each document.
• Problems and our propositions:– Ranking:
• Estimate relevancy of each doc against the geo-query and prioritize the docs describing restricted areas related to geo-query.
=> Ranking method which considers extents implied by place names.– Result representation:
• Represent the search results with consideration of geo-constraints and enable the users easily to decide whether to read docs or not even if the screen size is restricted.
=> Query-biased summarization, which utilizes place name expressions related to the geo-query, for GIR result snippets.
Hiroyuki Toda, Norihito Yasuda, Yumiko Matsuura, Ryoji Kataoka (NTT Cyber Solutions Labs.)
• Goal of our GIR: – Realize the searches for spots or services in our immediate surroundings
via mobile communication devices.
Warren Weaver Hall, New York University Semantic Knowledge Discovery, Organization and Use
Next...
P-11Toward Automatic Compilation of Phrasal Thesaurus
Atsushi Fujita, Satoshi SatoNagoya University
On Deck:Double Deck:
P-12 Saif MohammadP-13 Fabio Massimo Zanzotto
Warren Weaver Hall, New York University Semantic Knowledge Discovery, Organization and Use
P-11
Toward Automatic Compilationof Phrasal Thesaurus
Phrasal thesaurus Beyond the word-based semantic computing
Deals with various phrasal paraphrases
Atsushi Fujita and Satoshi Sato(Nagoya Univ., JAPAN)
Productive
Non-productive
X wrote Y X is the author of YX solves Y X deals with Y
X show a A Y X v(Y) adv(A)
X V YX V Y X’s V-ing of Y
Y be V-PP by X
burst into tears criedcomfort console
Generate!!
Collect!!
P-11
Warren Weaver Hall, New York University Semantic Knowledge Discovery, Organization and Use
Next...
P-12Towards Antonymy-Aware Natural Language Applications
Saif Mohammad and Bonnie Dorr, Graeme HirstUniversity of Maryland, University of Toronto
On Deck:Double Deck:
P-13 Fabio Massimo ZanzottoP-14 Nitin Madnani
Warren Weaver Hall, New York University Semantic Knowledge Discovery, Organization and Use
P-12
Towards Antonymy‐Aware NL Applica6ons Saif Mohammad, Bonnie Dorr, Graeme Hirst
• Scope – Clear opposites: wet‐dry, promoted‐demoted – Contras;ng word pairs: cold‐warm, promoted‐censured
• Method: – Iden;fy contras;ng word pairs using seed antonym pairs and
thesaurus categories. – Determine degree of antonymy using distribu;onal distance and
tendency to co‐occur. • Evalua;on:
– 950 GRE‐style closest‐opposite ques;ons. • Results:
– F score = .70 (baselines: .20 and .22). • Applica;ons:
– detec;ng incompa;bles (contradic;ons, sen;ment), genera;ng paraphrases, detec;ng humor, improving distribu;onal thesauri.
Warren Weaver Hall, New York University Semantic Knowledge Discovery, Organization and Use
Next...
P-13Combining Semi-Unsupervised Acquisition of Corpora and
Supervised Learning of Textual Entailment Rules
Fabio Massimo ZanzottoUniversity of Rome ”Tor Vergata”, Italy
On Deck:Double Deck:
P-14 Nitin MadnaniP-15 Justin Betteridge
Warren Weaver Hall, New York University Semantic Knowledge Discovery, Organization and Use
P-13
F.M.Zanzotto Saarbrucken 14/6/2007
University of Rome “Tor Vergata”
The Problem: To determine if:
“Kesslers team conducted 60,643 face-to-face interviews with
adults in 14 countries”
“Kesslers team interviewed more than 60,000 adults in 14
countries”
we need
• the equivalence between “X conducted Y interviews with
Z” and “X interviewed Y Z”
• the implication rule that says “X” “more than Y” if “X
is bigger than Y”
Combining Semi-Unsupervised Acquisition of Corpora and
Supervised Learning of Textual Entailment Rules
Fabio Massimo Zanzotto, Marco Pennacchiotti, Alessandro Moschitti
Warren Weaver Hall, New York University Semantic Knowledge Discovery, Organization and Use
Next...
P-14Applying Automatically Generated Semantic Knowledge A
Case Study in Machine Translation
Nitin Madnani, Philip Resnik, Bonnie Dorr and Richard SchwartzUniversity of Maryland
On Deck:Double Deck:
P-15 Justin BetteridgeP-16 Karin Verspoor
Warren Weaver Hall, New York University Semantic Knowledge Discovery, Organization and Use
P-14
• No single correct answer for MT• Need multiple correct (human) answers to tune MT system• Expensive to have humans create multiple translations
This Leads To Reference Sparsity!
Automatic Paraphrasing as E-to-E translation
O: We must bear in mind the community as a whole.P: We must remember the wider community.
O: France sent its proposal in the form of a “non-official paper”. P: French transmits its recommendations to serve as a “non- official document”.
O: They should be better coordinated and more effective. P: They should improve the coordination and efficacy.
O:Thirdly, the implications of enlargement for the union’s regional policy cannot be overlooked. P: Finally, the impact of enlargement for EU regional policy cannot be ignored.
Artificial “Reference” Translations (O: original, P: our paraphrase)
Tuning RefsNewswire Web
BLEU TER BLEU TER
1H 37.65 56.39 15.17 70.32
1H+1P 39.32 54.69 15.92 69.94
Significant improvements when using even a single additional artificial
reference for tuning
Applying Automatically Generated Semantic Knowledge:A Case Study in Machine Translation
Nitin Madnani, Philip Resnik, Bonnie Dorr & Richard Schwartz
Warren Weaver Hall, New York University Semantic Knowledge Discovery, Organization and Use
Next...
P-15Continuous Discovery of Semantic Knowledg
Justin Betteridge, Andrew Carlson, Sue Ann Hong, Estevam R.Hruschka Jr., Edith L. M. Law, Tom M. Mitchell, and Sophie H.
WangCMU
On Deck:Double Deck:
P-16 Karin VerspoorP-18 Svetlana Stoyanchev
Warren Weaver Hall, New York University Semantic Knowledge Discovery, Organization and Use
P-15
Toward Continuous Discovery of Semantic KnowledgeJustin Betteridge, Andrew Carlson, Sue Ann Hong, Estevam R. Hruschka Jr.,
Edith L. M. Law, Tom M. Mitchell and Sophie H. Wang. Carnegie Mellon University
SubGoal considered here:
• Achieving high semi-supervised
learning accuracy by coupling the
learning of many categories
• Domain: learning semantic classes
Coupling learning of functions f(x), g(x):
1. Propagate initial labeled examples of
f(x) to g(x)
2. Propagate self-labeled examples
3. Use learned instances/patterns of f(x)
Goal: Never-ending language learning
• Domain: learning semantic classes
of NPs
Multi-task learning with explicit
relationships between learning tasks
• subset(organization(x), university(x))
• exclusive(university(x),person(x))
• inverse(parentOf(x,y),childOf(x,y))
• childOf(x,y) => person(x) ^ person(y)
…
3. Use learned instances/patterns of f(x)
to assess patterns/instances of g(x)
Coupling country city company univ. mean
1 93.6 99.1 100.0 79.1 93.0
1,2,3 89.1 98.2 100.0 97.3 96.2
Bootstrap learning accuracy: iteratively
labeling 110 new examples from 8M web
pages
Warren Weaver Hall, New York University Semantic Knowledge Discovery, Organization and Use
Next...
P-16The Colorado OpenDMAP system: Building on Community
Ontologies and a Community, Platform for BiomedicalNatural Language Processing
Karin Verspoor, William Baumgartner, Kevin Cohen, HelenJohnson, and Larry HunterUniversity of Colorado Denver
On Deck:Double Deck:
P-18 Svetlana StoyanchevP-19 Jordan Boyd-Graber
Warren Weaver Hall, New York University Semantic Knowledge Discovery, Organization and Use
P-16
The Colorado OpenDMAP system Karin Verspoor, William Baumgartner, K. Bretonnel Cohen, Helen Johnson, Larry Hunter
Cyclin E2 interacts with Cdk2 in a func>onal kinase complex.
protein protein interac>on: interactor1: cyclin E2 interactor2: cdk2
ontology paDerns
OpenDMAP
freetext
extracted informa>on
CLASS: protein protein interac>on SLOT: interactor1 TYPE: molecule SLOT: interactor2 TYPE: molecule
PROTÉGÉ ONTOLOGY
{c‐interact} := [interactor1] interacts with [interactor2] {c‐interact} := [interactor1] is bound by [interactor2] …
PATTERNS
An ontology‐driven integrated concept recogni>on system with proven applicability to biomedical informa>on extrac>on problems.
Warren Weaver Hall, New York University Semantic Knowledge Discovery, Organization and Use
Next...
P-18Automatic Feature Discovery for Predicting Content of User
Utterances in Dialogs
Svetlana StoyanchevSUNY, Stony Brook
On Deck:Double Deck:
P-19 Jordan Boyd-GraberP-20 Breck Baldwin
Warren Weaver Hall, New York University Semantic Knowledge Discovery, Organization and Use
P-18
Predicting Content of User Utterances in Dialog
Svetlana Stoyanchev and Amanda StentSUNY, Stony Brook
Two-pass ASR approach:1. predict presence of task-relevant concepts in user
Pro
ble
m Goal: build dialog systems that allow users to speak freely. Automatic speech recognition (ASR) is a big issue (typical ASR error rate in dialog ~30%)
1. predict presence of task-relevant concepts in user utterances using:1. lexical features recognized by the first-pass of the ASR
2. dialog history features 3. prosodic features from the user’s speech
2. Adapt language model to the predicted content
Appro
ach
Result We achieve statistically significant (but small) improvements in second-pass ASR accuracy for one dialog context; plan to expand to others
Today: Performance
of different methods
of choosing lexical
features
Warren Weaver Hall, New York University Semantic Knowledge Discovery, Organization and Use
Next...
P-19Syntactic Topic Models
Jordan Boyd-Graber and David M. BleiPrinceton University
On Deck:Double Deck:
P-20 Breck BaldwinP-21 Cliff Joslyn
Warren Weaver Hall, New York University Semantic Knowledge Discovery, Organization and Use
P-19
α
αT
β
πk
τk
∞ M
θd
αD
σ
Syntactic Topic Models Jordan Boyd-Graber and David Blei
Princeton University
Documents are collections of parse trees.
z1
w2:slept
w3:they
w1:START
z2
z3
z4 w4:START
z5 w2:ran
The latent class depends on the parent node and the document's topic distribution.
Syntactic Topic ModelsJordan Boyd-Graber and David M. Blei
Princeton University Department of Computer Science
{jbg,blei}@princeton.edu
Both syntactic models and topic models are active, fruit-ful areas of research. One captures local patterns, and theother captures trends across many documents. To illustratethese di!erent but complementary views, consider the fol-lowing incomplete sentence from a travel brochure, “In aweek, you could go to .” A syntactic model such as the in-finite tree with independent children [1] tells us what wordscould be an object of a preposition (e.g., “bed,” “school,”“debt,”), and a topic model such as the hierarchical Dirich-let process (HDP) [5] could tell us what words fit with atravel theme (“vacation,” “relax,” “exotic,” etc.). In thiswork, we develop a model that can combine the constraintsof both syntax and semantics to build categories of wordsthat are consistent with both.
To do this, we build a model called the syntactic topicmodel (STM). Using a corpus composed of dependency parsetrees collected into documents, the STM learns “topics” thatare both thematically and syntactically consistent. Thesetopics, like the parts of speech in syntactic models or thesyntactically-uninformed topics in topic models, are distri-butions over the lexicon.
To incorporate syntax and semantics, the STM combinesthe per-document distributions over topics (as in topic mod-els) with the part of speech transition probabilities (as insyntactic models). It does this by taking the point-wiseproduct of these distributions and then selecting a latentclass for each word from this new, renormalized distribution,similar to the product of experts model [2]. More formally,the full generative model of the corpus is:
1. Choose global topic weights ! ! GEM(!)2. For each topic index k = {1, . . . }:
(a) Choose topic "k ! Dir("#u)(b) Choose topic transition distribution $k ! DP(!T , !)
!
!T
"
#k
$k
% M
&d
!D
'
(a) Overall Graphical Model
z1
w2:lay
w3:phrase
w7:forw7:his
w5:some w6:mind
w1:START
w9:year
w4:in
z2
z3 z4
z5
z6
z7
z8
z9
(b) Sentence GraphicalModel
Figure 1: Graphical model for a syntactic topic model (left); ingreater detail is the graphical model for each sentence (right).
his, their, other, us, its, last, one, all
0.42
0.10
0.57
0.06
0.26
0.29
0.08
0.31
0.67
0.06
0.28
policy, gorbachev,
mikhail, leader, soviet, restructuring,
software
0.95
START
garden, visit, having, aid,
prime, despite, minister,
especially
0.37
television, public,
australia, cable, host, franchise,
service
0.34
says, could,
can, did, do, may, does, say
0.11
they, who, he, there, one, we, also, if
0.11
mr, inc, co, president,
corp, chairman,
vice, analyst,
europe, eastern,
protection, corp, poland,
hungary, chapter, aid
0.52
shares, quarter,
market, sales, earnings, interest,
months, yield
0.22
0.25
0.09
Figure 2: On hand-parsed documents, the STM discovered twocategories of topics. Some topics (shaded with grey) were sharedacross almost all documents and filled the role of a generic partof speech, not reflecting any thematic specification. Other topics,however, are selected by a document’s semantic constraints.
3. For each document d = {1, . . . M}:(a) Choose topic weights %d ! DP(!D, !)(b) For each sentence root node:
i. Choose topic assignment z0 " %d$start
ii. Choose root word wd,0 ! mult(1, #z0 )
(c) For each additional word wd,n and parent pn, n # {1, . . . dn}i. Choose topic assignment zd,n " %d$zp(d,n)
ii. Choose word wd,n ! mult(1, "zd,n )
To discover the best configuration of these unobservedvariables in our generative process we use variational infer-ence for nonparametric Bayesian models [3]. This processuncovers the best top-level weights, topic transitions, per-document topic distributions, topic assignments, and topics.
We fit the STM to the Penn Treebank [4]. Instead ofgrouping all nouns into a single topic, some parts of speech(such as nouns and adjectives) are divided into specializedsyntactic groups that appear in similar documents (Fig-ure 2), but other parts of speech such as verbs and preposi-tions are shared across many documents. Quantitatively, theSTM also did better in predicting words on held-out data;its perplexity on held out documents was better (lower) thanthe HDP or the infinite tree.
References[1] J. R. Finkel, T. Grenager, and C. D. Manning. The infinite tree.
In ACL, pages 272–279, Prague, Czech Republic, June 2007. As-sociation for Computational Linguistics.
[2] G. Hinton. Products of experts. In Proceedings of the Ninth In-ternational Conference on Artificial Neural Networks, pages 1–6,Edinburgh, Scotland, 1999. IEEE.
[3] P. Liang, S. Petrov, M. Jordan, and D. Klein. The infinite PCFGusing hierarchical Dirichlet processes. In HLT, pages 688–697,2007.
[4] M. P. Marcus, B. Santorini, and M. A. Marcinkiewicz. Building alarge annotated corpus of English: The Penn treebank. Computa-tional Linguistics, 19(2):313–330, 1994.
[5] Y. W. Tee, M. I. Jordan, M. J. Beal, and D. M. Blei. Hierarchicaldirichlet processes. JASA, 101(476):1566–1581, December 2006.
Learned topics are consistent with both syntax and theme.
Poster 19
Warren Weaver Hall, New York University Semantic Knowledge Discovery, Organization and Use
Next...
P-20Is Semantics Just Picking the Right Syntax for the Context
from Multiple possiblties?
Breck BaldwinAlias-i
On Deck:Double Deck:
P-21 Cliff JoslynP-23 Eiman Tamah Al-Shammari
Warren Weaver Hall, New York University Semantic Knowledge Discovery, Organization and Use
P-20
P-20Is Semantics Just Picking the Right Syntax for the Context
from Multiple possiblties?
Breck BaldwinAlias-i
Warren Weaver Hall, New York University Semantic Knowledge Discovery, Organization and Use
Next...
P-21Semantic Hierarchies: Induction, Measurement, and
Management
Cliff Joslyn, Michelle Gregory, Liam McGrath, Patrick Paulson,Karin Verspoor
Pacific Northwest National Laboratory, University of Colorado Denver
On Deck:Double Deck:
P-23 Eiman Tamah Al-ShammariP-24 Kimiaki Shirahama
Warren Weaver Hall, New York University Semantic Knowledge Discovery, Organization and Use
P-21
Semantic Hierarchies: Induction, Measurement, and Management
Concept lattices
Semantic hierarchies from relational data
Semantic Hierarchies:Cores of ontologies80-90% of links in real-world ontologiesBecoming large:
104-106 nodes
Need for algorithms and measures
Induction from text
Visualization, annotation
Alignment, matching
Mathematical order theoryMetrics: Distances and similarities
Wordet
Gene Ontology
relational data
Capture implicationrelations dually between objects, attributes
Unbiased, graphical, visual representation
Metrics: Distances and similarities based on semi-modular valuation functionsRanks: Structure of vertical levels Morphisms: Mappings and linkages
Issues for knowledge systemsEnable robust use of multiple inheritance: beyond trees!Avoid risks of pure graph theory, path-counting methodsProper use of vertical levels
Warren Weaver Hall, New York University Semantic Knowledge Discovery, Organization and Use
Next...
P-23Syntactical Knowledge usage to Reduce Arabic/English
Stemming Errors
Eiman Tamah Al-ShammariKuwait University, George Mason University
On Deck:Double Deck:
P-24 Kimiaki ShirahamaP-25 Kazuhiro Seki
Warren Weaver Hall, New York University Semantic Knowledge Discovery, Organization and Use
P-23
P-23Syntactical Knowledge usage to Reduce Arabic/English
Stemming Errors
Eiman Tamah Al-ShammariKuwait University, George Mason University
Warren Weaver Hall, New York University Semantic Knowledge Discovery, Organization and Use
Next...
P-24Characteristics of Textual Information in Video Data from
the Perspective of Natural Language Processing
Kimiaki Shirahama, Akihito Mizui and Kuniaki UeharaKobe University
On Deck:Double Deck:
P-25 Kazuhiro SekiP-28 Marine Carpuat
Warren Weaver Hall, New York University Semantic Knowledge Discovery, Organization and Use
P-24
Characteristic of Textual Information in Video DataCharacteristic of Textual Information in Video Datafrom the Perspective of Natural Language Processingfrom the Perspective of Natural Language Processing
Topic detection in videos using utterances obtained by ASR method (ASR transcripts)→ Efficient search and browsing of a video archivePurposePurpose
Video
Audio
Text documentText document VideoVideoAll the semantic contents are conveyedonly through a text medium.
Semantic contents are conveyed through synchronizedvideo and audio media in a complementary manner.Synergy between video and audio media
Pattern of word occurrences
Preliminary examination of whether NLP methods can appropriatelyPreliminary examination of whether NLP methods can appropriately process ASR transcriptsprocess ASR transcripts
Trigger pair extraction → NLP methods cannot treat temporal distributions of spoken words.Topic extraction by LDA → The same words are commonly spoken in different words.The same word is not spoken so many times. → Burst detection based on character’s appearance
Trigger pair….. President Kennedy had embarked on a tour of Texas in an effort to raise campaign funds and to unite party members. The President, accompanied by Vice-President Lyndon B. Johnson, Texas Governor John Connally, ….. The motorcade started a few minutes late but managed to proceed close to its schedule. The crowds were exuberant, encroaching on every vantage point along the route. ….. Incidents such as that, the clearing weather, the bright warm sun, and the tremendous and loudly cheering crowds were exactly what the president needed. ….. The Kennedy magic was at its best. Then, more than halfway along the route through Dallas, and just as the motorcade broke through the heaviest street crowds, ….. Shots echoed through Dealey Plaza. President Kennedy was mortally wounded, Governor Connally was seriously wounded. …..
Burst
time
Oh my god!President, please shake with me.
Kennedy, it’s fine today.
Yeah. Sure.I’m happy many people cometo this campaign at Dallas.
Oh no! Jesus Christ!
Discuss how to improve NLP methods for ASR transcript processing!
Warren Weaver Hall, New York University Semantic Knowledge Discovery, Organization and Use
Next...
P-25Biomedical Association Discovery via Complementary TDM
Kazuhiro Seki and Kuniaki UeharaKobe University
On Deck:Double Deck:
P-28 Marine CarpuatP-29 Rion Snow
Warren Weaver Hall, New York University Semantic Knowledge Discovery, Organization and Use
P-25
K. Seki & K. Uehara at Kobe University (PK. Seki & K. Uehara at Kobe University (P‐‐25)25)
Text data mining (TDM)
Explicit information Implicit information
IR, IE, Classification, Summarization, etc.
hypothesis discovery orliterature‐based discovery
G O t l t ti G ti i ti di
Genesg2g2g1g1 lglarticle
Gene Ontology annotation Genetic association discovery
Phenotypes
Gene functions
p1p1 p2p2 pnpn
1f1 2f2 3f3 f 1fm‐1 fmfm
GO
negative
positive
annotationRepeat foreach gene
DiseasedCCBP MF
Warren Weaver Hall, New York University Semantic Knowledge Discovery, Organization and Use
Next...
P-28Word Sense Disambiguation for Statistical Machine
Translation
Marine CarpuatColumbia University
On Deck:Double Deck:
P-29 Rion SnowP-30 Delip Rao
Warren Weaver Hall, New York University Semantic Knowledge Discovery, Organization and Use
P-28
WordSenseDisambigua1onforSta1s1calMachineTransla1on
MarineCarpuatColumbiaUniversityCenterforComputa1onalLearningSystems
MostSMTsystemsdonotexplicitlyuseWSD sta1ctransla7onprobabili7es,notsensi7vetocontext
ButusingWSDforSMTfirstgaveconfusingresults WSDforSMThurtsBLEUscore!?[Carpuat&WuACL‐2005]
ButWSDshouldhelpSMT…[Carpuat&WuIJCNLP‐05]
GeneralizingWSDtoPhraseSenseDisambigua0onforSMT[Carpuat&Wu,2007] PSDisfullyphrasaljustlikeconven7onalSMTlexicons
PSDpredic7onsarefullyintegratedinSMTdecoding
PSDmodelsaretrainedonthesameparalleldataasSMTlexicons
PSDimprovestransla7onqualityconsistentlyon8metricsand4tasks
Warren Weaver Hall, New York University Semantic Knowledge Discovery, Organization and Use
Next...
P-29Crowdsourcing Annotations for Natural Language Tasks: An
Evaluation
Rion SnowStanford University
On Deck:Double Deck:
P-30 Delip RaoP-31 James Mayfield
Warren Weaver Hall, New York University Semantic Knowledge Discovery, Organization and Use
P-29
Crowdsourcing Annotations for Natural Language Tasks: An Evaluation
• What would you do if you had an on-demand army of thousands of annotators?
• 10,000 labels / day
• 1,000 labels / dollar
• Expert-quality labeling or better (with some tricks)
• Results on five natural language tasks
Cheap and Fast - But is it Good? Snow et al., EMNLP-2008
Warren Weaver Hall, New York University Semantic Knowledge Discovery, Organization and Use
Next...
P-30Bootstrapping Extraction Patterns from Wikipedia
Delip RaoJHU
On Deck:Double Deck:
P-31 James MayfieldD-1 Daniel Tunkelang
Warren Weaver Hall, New York University Semantic Knowledge Discovery, Organization and Use
P-30
P-30Bootstrapping Extraction Patterns from Wikipedia
Delip RaoJHU
Warren Weaver Hall, New York University Semantic Knowledge Discovery, Organization and Use
Next...
P-31Knowledge Base Evaluation for Semantic Knowledge
Discovery
James Mayfield, Bonnie Dorr, Tim Finin, Douglas Oard andChristine Piatko
Human Language Technology Center of Excellence
On Deck:Double Deck:
D-1 Daniel TunkelangD-3 David Nadeau
Warren Weaver Hall, New York University Semantic Knowledge Discovery, Organization and Use
P-31
Mayfield, Dorr, Finin, Oard, Piatko NYU Symposium on Semantic Knowledge Discovery, Organization and Use
Knowledge Base Evaluationfor Semantic Knowledge Discovery
• Key idea: evaluate knowledge base, not extraction output• Six evaluation axes
– Accuracy– Usefulness– Augmentation– Explanation– Adaptation– Temporal Qualification
• This approach has many advantages!
KBStructured Knowledge
Entities
Events
Relations
PERSONAli Hassan al-Majidيتيركتلا ديجملا دبع نسح يلعDOB: 1941Citizenship: IraqPosition: Defense Minister
ORGANIZATIONJihaz al-Mukhabarat al-AmmaAKA: Jihaz al-KhasCountry: Iraq
Evaluate
Ali Hassan al-Majid
يتيركتلا ديجملا دبع نسح يلع
Evaluate
Warren Weaver Hall, New York University Semantic Knowledge Discovery, Organization and Use
Next...
D-1Unsupervised Annotation and Exploratory Search
Daniel TunkelangEndeca
On Deck:Double Deck:
D-3 David NadeauD-4 Gregory Marton
Warren Weaver Hall, New York University Semantic Knowledge Discovery, Organization and Use
D-1
Warren Weaver Hall, New York University Semantic Knowledge Discovery, Organization and Use
Next...
D-3Demo of Semi-Supervised Named Entity Recognition at
OpenPlaces
David NadeauOpenplaces
On Deck:Double Deck:
D-4 Gregory MartonD-5 Mona Diab
Warren Weaver Hall, New York University Semantic Knowledge Discovery, Organization and Use
D-3
open
- Minimal human input
- Web page wrapper induction
Named
openplaces
inimal human input
eb page wrapper induction
Semi-supervised
Named Entity
places
inimal human input
eb page wrapper induction
supervised
ntity Recognition
placestm
eb page wrapper induction
ecognition
- T
- 1 trillion ‘relations'
Travel ontology
1 trillion ‘relations'
Travel ontology
ravel ontology
1 trillion ‘relations'
Travel ontology
Semantic Search Engine
for the Travel domain
Search Engine
Travel domain
Search Engine
Travel domain
Warren Weaver Hall, New York University Semantic Knowledge Discovery, Organization and Use
Next...
D-4Procedure Discovery for Time Expression Understanding
Gregory MartonMIT
On Deck:Double Deck:
D-5 Mona DiabD-6 Michael Paul
Warren Weaver Hall, New York University Semantic Knowledge Discovery, Organization and Use
D-4
Procedure Discoveryfor Time Expression Understanding
Gregory [email protected]
Existing Lexicon"tomorrow" : (λ.t (.add t 1 'day))
"May Day" : (λ.t (.near t #:month 5 #:day 1))
"Thrusday" : (λ.t (.near t #:day-of-week 4))
...
Learned Semantics
Unseen Word
"World AIDS Day""Veterans Day""May Day""Thanksgiving"...
"Thrusday" "Earth Day"
"Thursday""Thruway""Tuesday"...
Source Semantics
DistributionallySimilar Words
(λ.t (.near t #:month 11 #:day 11))"Earth Day" : (λ.t (.near t #:month 4 #:day 22)) (λ.t (.near t #:month 11 #:day-of-week 4 #:nth 4))
VAL="2003-04-22"
VAL="2001-09-11"
(λ.t (.near t #:month 5 #:day 1))
Unseen Meaning"9/11"
VAL="2003-04-22"
"9/11" : (λ.t (.set-value t "2001-09-11"))
"Thursday" : (λ.t (.near t #:day-of-week 4))
Warren Weaver Hall, New York University Semantic Knowledge Discovery, Organization and Use
Next...
D-5SALAMCAT: Sense Assignment Leveraging Alignments,
Monolingual Contexts And Translations
Mona Diab and Weiwei GuoColumbia University
On Deck:Double Deck:
D-6 Michael PaulD-7 Emily Jamison
Warren Weaver Hall, New York University Semantic Knowledge Discovery, Organization and Use
D-5
D-5SALAMCAT: Sense Assignment Leveraging Alignments,
Monolingual Constexts And Translations
Mona Diab and Weiwei GuoColumbia University
Warren Weaver Hall, New York University Semantic Knowledge Discovery, Organization and Use
Next...
D-6AIRTA: An Automatic Inter-disciplinary Research Topic
Advisor - Where are We and Where do We Go -
Michael Paul and Roxana GirjuUniversity of Illinois at Urbana-Champaign
On Deck:Double Deck:
D-7 Emily JamisonD-8 Toru Hirano
Warren Weaver Hall, New York University Semantic Knowledge Discovery, Organization and Use
D-6
Michael Paul† and Roxana Girju‡Departments of Computer Science(† ‡) and Linguistics (‡), Beckman Institute († ‡)
University of Illinois at Urbana-Champaign{mjpaul2, girju}@illinois.edu
IntroductionWe believe that like other disciplines, computational linguistics will drastically benefit from an inter-disciplinary perspective.
Our tool is designed to foster interdisciplinary research in order to make breakthrough predictions for future directions.
This is accomplished by analysing trends within and across relevant fields and then automatically suggesting new research directions and topics.
Some fields motivating research in computational linguistics
Trends/AnalysisBecause our data is categorized and labelled by year, we can see how research in certain fields rises and declines over time.
We can use this informationto gauge which topics areimportant and which areasare saturated.
We also look forcorrelations in trendsin similar fields acrossdifferent disciplines.
Next StepThe next phase of this project (the final goal) will be to generate new topics. The key is to discover topics that are important in one discipline but have been studied little in another. These suggestions will be useful to professionals who would like to engage in research discussions with other parties, but who are not familiar with those areas. It will be beneficial to students looking for novel research topics.
Back EndWe currently have a database with:
4,700 papers from computational linguistics conferences
2,300 papers from linguistics journals
1,700 papers from education/educational psychology journals
We will enlarge our corpus as we continue to work on this project.
ClassificationWe categorized these papers mostly using Latent Dirichlet Allocation (LDA) with words from titles, abstracts, and full text when available.
AIRTA: An Automatic Interdisciplinary Research Topic Advisor- Where are We and Where do We Go -
Industry
Linguistics MachineLearning
CognitivePsychology
Education
Lexical Semantics lexical entries semantic word idioms words lexiconMorphology morphological word morphology lexical level formsMT Evaluation evaluation score human scores sentence automatic
Named EntitiesMultimodal NLP multimodal speech gesture user language input
entity names named entities ne information person
Each dot represents a paper in the “Dialogue Systems” category. The coloring shows how papers can span multiple categories.
A sample of categories and the top keywords associated with them
Language-related topics comprise the bulk of research in education and are steadily
increasing in prominence.
Warren Weaver Hall, New York University Semantic Knowledge Discovery, Organization and Use
Next...
D-7CACTUS: A User-friendly Toolkit for Semantic
Categorization and Clustering in the Open Domain
Emily JamisonThe Ohio State University
On Deck: D-8 Toru Hirano
Warren Weaver Hall, New York University Semantic Knowledge Discovery, Organization and Use
D-7
Open-domainNo Training Required
Easy-to-use GUI
Near-universal coverageInternet as Knowledge SourceOr, command-line interface
CACTUS: A User-friendly Toolkit for SemanticCategorization and Clustering in the Open Domain
Emily K. Jamison CACTUS: A 1-Slide Introduction
Warren Weaver Hall, New York University Semantic Knowledge Discovery, Organization and Use
Next...
D-8Aggregating Knowledge of Named Entity Relations
Toru Hirano, Yoshihiro Matsuo, and Genichiro KikuiNTT Cyber Space Laboratories
Warren Weaver Hall, New York University Semantic Knowledge Discovery, Organization and Use
D-8
Geographicdatabase
D-8: Aggregating Knowledge of Named Entity Relations
“George Bush is the President of the U.S”
NY
the U.S.
NE2:String
New YorkCity-010
United Statesof America-001
NE2:ID
Speech
President
Relationship
George W.Bush-001
Bush
George W.Bush-001
GeorgeBush
NE1:IDNE1:String
Web
[ President, George Bush, the U.S. ]
Wikipedia
Relational Database
Extractor
8 million records from 14 million web pages
Warren Weaver Hall, New York University Semantic Knowledge Discovery, Organization and Use