Twenty Years of Language Resource
Development and Distribution: A Progress Report on LDC Activities
Christopher Cieri, Marian Reed, Denise DiPersio, Mark Liberman
University of Pennsylvania, Linguistic Data Consortium
{ccieri, mreed, dipersio, myl} AT ldc.upenn.edu
Models
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
20 Memberships Years, currently running from January-December
Models
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
Member
In any single membership year, LDC releases 30-36 corpora
(in early years, range was 14-50 corpora/year)
Models
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2 basic membership types: Standard & Subscription
3 member types: Non-Profit, Government, Commercial
Models
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
Non-Profit Organization
Standard Membership
Fee: $2400 to help sustain the Consortium
Select any 16 data sets
Research Use
Ongoing Rights
Member
Models
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
Member
Non-Profit Organization
Subscription Membership
Fee: $3850 to help sustain the Consortium
All data sets * 2 copies shipped automatically
Research Use
Ongoing Rights
Value
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
Members
Members in 1996, 1997 paid $2000 fee for each year
Received 18 CALLHOME data sets which cost $5,000,000 to create
plus 36 other corpora including:
Switchboard-1 Release 2
The CMU Kids Corpus
1996 Speaker Recognition Benchmark
Boston University Radio Speech Corpus
ROI=1250% just on the CALLHOMEs
Develop Costs per single Corpus min=$42,000, max=$2,000,000
Lowest possible ROI=153%
Alternatives
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
Government Membership same fees as Non-Profit, different agreement
Commercial Membership higher fees, commercial rights
Many corpora can be licensed individually but at greater unit cost.
Can’t even afford $2400?
Wait, how did you get here, then?
LDC is a Consortium, an organization of organizations,
established for their mutual benefit.
LDC sometimes trades data sets for other Language Resources
or services.
Provided it offers the Consortium a positive return on their
investment
Let’s talk!
OK. Never mind.
Origin & Model
Linguistic Data Consortium established 1992 via open, competitive DARPA solicitation, won by U. Penn.
centralize distribution, archiving of language data
manage licenses & distribution practice
Business Model developed by overseers from government, industry and academia
DARPA funding covered operations, corpus creation for 5 years
required to be self-sufficient via annual membership fees, data licenses
new grants fund LR creation, not maintenance; NSF, NIST early supporters
Data Sources donations, funded projects, community initiatives and LDC initiatives
Membership members provide annual support generally fees, sometimes data, services
receive ongoing rights to data published in years when they support LDC
reduced fees on older corpora, extra copies
access to LDC Online
Benefits
Uniform licensing within & across research communities
4 basic user license types, 1000s of instances
~100 provider arrangements
no significant copyright issues in 20 years of operations
several independent issues resolved
Cost Sharing
relieves funding agencies of distribution costs, concerns
provides vast amounts of data to members LDC annual membership benefit ~30 corpora
development cost for 1 corpus ≥ (LDC membership fee * 10 | 100 | 1000)
Stable research infrastructure
LRs permanently accessible, across multiple platform changes
terms of use & distribution methods standardized & simple
members’ access to data is ongoing
any patches available via same methods
tools, specifications, papers distributed without fee
Models
Provider
Provider
Provider
Provider
Provider
Provider
Provider
Provider
Provider
Provider
Member
Member
Member
Member
Member
Member
Member
Member
Member
Member
Models
Provider
Provider
Provider
Provider
Provider
Provider
Provider
Provider
Provider
Provider
Member
Member
Member
Member
Member
Member
Member
Member
Member
Member
Models
Provider
Provider
Provider
Provider
Provider
Provider
Provider
Provider
Provider
Provider
Member
Member
Member
Member
Member
Member
Member
Member
Member
Member
Models
Provider
Provider
Provider
Provider
Provider
Provider
Provider
Provider
Provider
Provider
Member
Member
Member
Member
Member
Member
Member
Member
Member
Member
Models
Provider
Provider
Provider
Provider
Provider
Provider
Provider
Provider
Provider
Provider
Member
Member
Member
Member
Member
Member
Member
Member
Member
Member
Models
Provider
Provider
Provider
Provider
Provider
Provider
Provider
Provider
Provider
Provider
Member
Member
Member
Member
Member
Member
Member
Member
Member
Member
Models
Provider
Provider
Provider
Provider
Provider
Provider
Provider
Provider
Provider
Provider
Member
Member
Member
Member
Member
Member
Member
Member
Member
Member
Models
Provider
Provider
Provider
Provider
Provider
Provider
Provider
Provider
Provider
Provider
Member
Member
Member
Member
Member
Member
Member
Member
Member
Member
Models
Provider
Provider
Provider
Provider
Provider
Provider
Provider
Provider
Provider
Provider
Member
Member
Member
Member
Member
Member
Member
Member
Member
Member
Models
Provider
Provider
Provider
Provider
Provider
Provider
Provider
Provider
Provider
Provider
Member
Member
Member
Member
Member
Member
Member
Member
Member
Member
Models
Provider
Provider
Provider
Provider
Provider
Provider
Provider
Provider
Provider
Provider
Member
Member
Member
Member
Member
Member
Member
Member
Member
Member
Models
Provider
Provider
Provider
Provider
Provider
Provider
Provider
Provider
Provider
Provider
Member
Member
Member
Member
Member
Member
Member
Member
Member
Member
Models
Provider
Provider
Provider
Provider
Provider
Provider
Provider
Provider
Provider
Provider
Member
Member
Member
Member
Member
Member
Member
Member
Member
Member
Models
Provider
Provider
Provider
Provider
Provider
Provider
Provider
Provider
Provider
Provider
Member
Member
Member
Member
Member
Member
Member
Member
Member
Member
Models
Provider
Provider
Provider
Provider
Provider
Provider
Provider
Provider
Provider
Provider
Member
Member
Member
Member
Member
Member
Member
Member
Member
Member
Models
Provider
Provider
Provider
Provider
Provider
Provider
Provider
Provider
Provider
Provider
Member
Member
Member
Member
Member
Member
Member
Member
Member
Member
Models
Provider
Provider
Provider
Provider
Provider
Provider
Provider
Provider
Provider
Provider
Member
Member
Member
Member
Member
Member
Member
Member
Member
Member
Models
Provider
Provider
Provider
Provider
Provider
Provider
Provider
Provider
Provider
Provider
Member
Member
Member
Member
Member
Member
Member
Member
Member
Member
Models
Provider
Provider
Provider
Provider
Provider
Provider
Provider
Provider
Provider
Provider
Member
Member
Member
Member
Member
Member
Member
Member
Member
Member
Data Center
IPR Intermediary
IPR intermediary improves combinatorics
Providers + Members
NOT Providers * Members
But much more
speaks for a group of 3168 organizations
greater experience negotiating IPR than any single member
dedicated staff, trained and experienced
linguists, computer scientists can focus on what they do best
works with researchers who have high value contact
spreads cost of any data acquisition over user base
consistent, attractive terms to providers and users
peace-of-mind for providers
clarity for members
Benchmarking
Since inception in 1992, LDC has distributed
>84000 copies
>1300 titles
>3168 organizations
>70 countries
About half of the titles are e-corpora
developed for technology evaluation programs
released generally after use in the relevant communities
64 titles added to Catalog since last LREC
>4 years of publications “in queue”!!!
8309 academic papers relying on LDC Corpora
search for such papers is ~ 60% complete
Benchmarking
Benchmarking
LDC Roles
distribution & archiving
language resource production, including quality control
intellectual property rights and license management
human subject protocol management
data collection
annotation and lexicon building
creation of tools, specifications, best practices
knowledge transfer: documentation, metadata, consulting, training
corpus creation research (meta-research) and academic publication
resource coordination in large multisite programs
serving multiple research communities
as funding panelists, workshop participants and oversight committee
members.
LDC Structure
Mark LibermanDirector
Christopher CieriExecutive Director
Christopher Walker Manager, Software
Development
Mohamed Maamouri, Sr. Research
Administrator
Andrea Mazzucchi, Manager Systems
Stephanie StrasselSr. Assoc Dir Collection,
Annotation
DeniseDiPersio
Assoc Dir, External Relations
Mark MandelResearcher
Natalia Bragilevskaya, Manager LDC/IRCS RBO
Karina CzokaFiscal Coordinator
Ikeila TurnerOffice Manager
Daniel Jacquette Publications Manager
Marian ReedCommunications
Manager
NameWebmaster
Eleftheria Ahtaridis Membership Manager
Andrew McMackinSysAdmin
Wayne HillSysAdmin
Miguel ReynosoHelp Desk Coordinator
Yiwola AwoyaleResearcher
Moussa BambaResearcher
Seth KulickResearcher
Ann BiesRes. Project Manager
Justin MottLead Annotator
Dave GraffInternal Consultant
Jon WrightLead Programmer
Kevin WalkerLead Programmer
Haejoong LeeLead Programmer
John MalamonSr. Programmer/Analyst
Preston CabeSr. Programmer/Analyst
Chris CarusoProgrammer/Analyst
Brendan CallahanProgrammer/Analyst
Will Haun Programmer/Analyst
Robert ParkerSr. Programmer/Analyst
Brian GainorProgrammer/Analyst
Ann SawyerResearch Project Coord
Lauren SummersResearch Project Coord
Xiaoyi MaLead Programmer
Steve Grimes Sr. Programmer/Analyst
Mike CiulProgrammer/Analyst
Programmer/Analyst
Programmer/Analyst
Zhiyi SongSr. Project Manager
Kira GriffitRes. Project Manager
Amanda MorrisRes. Project Manager
Kari van der Clouet Sr. Project Manager
Xuansong LiResearch Administratir
Safa IsmaelResearch Project Coord
Ramez ZakharyCoord/Lead Annotator
Dalal ZakharyCoord/Lead Annotator
Alonso IndacocheaCoord/Lead Annotator
Joe EllisRes. Project Manager
Jennifer GarlandResearch Project Coord
Linda BrandschainRes. Project Manager
Karen JonesResearch Project Coord
Stephanie SessaCoord/Lead Annotator
Neville RyantResearch Programmer
Data Manager
Grants in Data
Grants in Data
Significant concern & policy about an ill-understood phenomenon
a crypto-zoology of HLT researchers
LDC Principle: no one with a bona fide research agenda and a
genuine lack of ability to contribute will go without data
26 free data sets (click What’s Free on home page)
numerous, numerous arrangements to get data to needy researchers
Formalization:
grants in data each semester
requirements: data use statement, letter of support from advisor
Grants
2010: 8 corpora
2011: 24 corpora
2012: 8 corpora
CS, EE, oriental studies, second language acquisition and teaching
$40,000 awarded to date
Programs
Goal NW WB BN/C CTS IV Vid OTHER
CALLHOME STT
CALLFRIEND LR
SWITCHBOARD STT
Mixer SR
LCTL Translingual IR, MT
TDT STT, MT, IR
TIDES STT, MT, IR, IE
EARS STT
GALE STT, MT, IR, IE, SUM
MADCAT HR
MR QA
RATS STT
BOLT MT
BEST SR
ALADDIN Video ED
DOE Reading Enh. Language Learning
DOE Dictionaries Language Learning
LDC Online Access
Net-DC Networking
TalkBank Networking
Bio-IE IE
SCOTUS Access, Diarization
Digging into Data Mining
PNG/BOLD Fieldwork
LDC Data in NIST Evaluations
96 97 98 99 00 01 02 03 04 05 06 07 08 09 10 11
LRE ✓ ✓ ✓ ✓ ✓ ✓
SRE ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
BN Re ✓ ✓ ✓ ✓
CTS Re ✓ ✓ ✓ ✓
SDR ✓ ✓ ✓
TDT ✓ ✓ ✓ ✓ ✓ ✓ ✓
ACE ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
OpenMT ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
DUC ✓ ✓ ✓ ✓ ✓ ✓ ✓
RT ✓ ✓ ✓ ✓ ✓ ✓ ✓
STD ✓
GALE Trans ✓ ✓ ✓ ✓ ✓ ✓
MetricsMaTr ✓ ✓
MADCAT ✓ ✓ ✓ ✓
TAC KBP ✓ ✓ ✓
TRECVid SED ✓ ✓ ✓ ✓
TRECVid MED ✓ ✓
Data Collection
news text, journals, financial documents
web text: newsgroups, blogs, discussion fora
email, chat, SMS, tweets
biomedical text & abstracts
printed, handwritten & hybrid documents
broadcast news & conversation, podcasts
conversational telephone speech
lectures, interviews, meetings, field interviews
read & prompted speech
task oriented speech, role play, speech in noise
web video
animal vocalizations
Adaptation: Annotation
data scouting, selection, triage
audio-audio alignment; bandwidth, signal quality, language, dialect, program, speaker
quick & careful transcription
segmentation & alignment at story, turn, sentence, word level
orthographic & phonetic script normalization
phonetic, dialect, sociolinguistic feature & supralexical
documenting zoning, handwriting transcription, OCR
tokenization and tagging of morphology, part-of-speech, gloss
syntactic, semantic, discourse function, disfluency, sense disambiguation
fine and coarse-grained topic, relevance, novelty, entailment
identification, classification of mentions in text of entities, relations, events, time, location & co-reference
knowledgebase population
single & multi-document summarization of various lengths from titles-200
translation, multiple translation, edit distance, translation post-editing, quality control
alignment of translated text at document, sentence, phrase & word levels
physics of gesture
identification, classification of entities and events in video
Program Services
assess program needs: sponsors, developers, evaluators
develop timelines for LR creation and system evaluation
translate of “wish lists” into feasible action plan
coordinate LR activities across & among programs
maintain data matrices of LR features and availability
maintain optimization, stabilization of data requirements
incorporate technology into data production improving
rapidly catalog, license, replicate, distribute program LRs
broaden program impact through general distribution
protection of restricted data
Cost Models
Development Internal Distribution
External Distribution Maintenance
Consortial
Early CT
DARPA
NSF
User Pays
Sponsor Pays
Conclusion
Data Centers must adapt, maintain their role in LR sharing
Data Centers alone offer
dedicated labor force
specialized equipment
special training
needed to
fulfill their mission of lower barriers to LR access
simplify discovery
guarantee longevity
reduce cost
Conclusion
LDC is large & expanding (with increasing circumspection)
Much wheel reinvention in context of new initiatives
Offer services that allow HLT researchers to focus on HLT
Expanding in
volume
diversity
quality
languages
data types
annotations
services