Post on 05-Jan-2016
description
transcript
Korea Terminology Research Center for Language and Knowledge Engineering
Infrastructures for the Korean Language
Key-Sun Choi
Korea Terminology Research Center for Language and Knowledge Engineering
Academic Society
SIG-Korean Language Computing under Korea Information Science Society 300 members
Korea Information Society linguistics oriented
Korea Terminology Research Center for Language and Knowledge Engineering
KIBS Korea Information Base and Systems
Purpose: To improve Korean Language Processing Technology To promote Korean Software Industry
• in the planning phase (1993), targetted to Hangul Wordprocessor, Machine Translation and Korean Linguistic Research
1995 - 1997 (Phase 1): “word” Two ministry joint project + Industry
• Ministry of Science&Technology, Ministry of Culture 1998 - 2000 (Phase 2): “sentence”
Only by Ministry of Science&Technology + Industry will be evaluated in October, 2000
2001 - 2003 (Phase 3): “discourse” - not decided http://kibs.kaist.ac.kr/
Korea Terminology Research Center for Language and Knowledge Engineering
King Sejong Project
Purpose To promote the Korean Language Research in the linguistics sid
e To prepare for the language planning
for Unification of South-/North-Korea for International use of Korean
Sponsor: Ministry of Culture Period: 1998 - 2007 (10 years) Items
corpus, dictionary, internationalization, terminology, education, font, old Korean, old Chinese characters
http://www.sejong.or.kr/
Korea Terminology Research Center for Language and Knowledge Engineering
KIBS: Architecture
User(Dictionary)
End User
MA1
MA2
TA1
TA2
PA1
PA2
WSD1
WSD2
DA1
DA2
RM1
RM2
Ontology
Common Knowledge
Domain Knowledge
Electronic Dictionary
Engine Module Level
Engine Level
Basic DB
corpus
MRD
Knowledge extractor
Knowledge Source Level
MT engine IR engineSpell checker Style checker UI engine
Application LevelWord processor MT system Information
RetrievalSystem
Automatic Speech
Translation
User(P
rogramm
er)U
ser(lexicograph
yist)
QualityManagementSystem
-- System
Terminology
Distributed ResourceManagement System
Master DB
TaggingSupport Tool
Knowledge Level
TerminologyDB
Korea Terminology Research Center for Language and Knowledge Engineering
KIBS: Introduction
Title of Project KIBS I : Integrated Korean Information Base KIBS II : On Development of Deep-Level Processing and Q
uality Management Technology for Very Large Korean Information Base
OutlineTerm : 1994.12.4 ~ 2004.9.30 (10 years)Sponsor : Ministry of Science and TechnologyStaff : 50 person/year
Korea Terminology Research Center for Language and Knowledge Engineering
The Goal of First step
•Standard Module Interface•Corpus and Electronic Dictionary Development and Management System •Korean Part-of-Speech Tagging System•Korean Syntactic Tagging System•Korean/English Alignment System
•Standard Module Interface•Corpus and Electronic Dictionary Development and Management System •Korean Part-of-Speech Tagging System•Korean Syntactic Tagging System•Korean/English Alignment System
•Terminological Data Base Development and Management System•Standard Korean Input/Output Environment•Standardized Methodology for the Construction of a Balanced Corpus•Part-Of-Speech Transfer Dictionary Rules and an Example Package
•Terminological Data Base Development and Management System•Standard Korean Input/Output Environment•Standardized Methodology for the Construction of a Balanced Corpus•Part-Of-Speech Transfer Dictionary Rules and an Example Package
•Tree-Tagged Corpus•Word-Level Narrative Speech Data Base•Hand-written Hangul scripts of high frequency
•Tree-Tagged Corpus•Word-Level Narrative Speech Data Base•Hand-written Hangul scripts of high frequency
The Standardization & the Specification for Korean Information BaseThe Standardization & the Specification for Korean Information Base
The Development of an Integrated, Environment and Support Management SystemThe Development of an Integrated, Environment and Support Management System
The Construction of Korean Information BaseThe Construction of Korean Information Base
Korea Terminology Research Center for Language and Knowledge Engineering
The Goal of Second step
•Terminology Entries•Domain-specific Corpus for Terminology Building•Sublanguage Analysis and Extraction of Terminology
•Terminology Entries•Domain-specific Corpus for Terminology Building•Sublanguage Analysis and Extraction of Terminology
•Development/Management System for Information Base •Development of Integrated Management System for Distributed Resources
•Development/Management System for Information Base •Development of Integrated Management System for Distributed Resources
•Syntactic Information Base for Syntactic Analysis/Generation•Semantic Information Base for Semantic Analysis/Generation•Additional Information on Language and GUI for Developing Applications
•Syntactic Information Base for Syntactic Analysis/Generation•Semantic Information Base for Semantic Analysis/Generation•Additional Information on Language and GUI for Developing Applications
Quality Management System for Language Information Processing Quality Management System for Language Information Processing
Terminology Dictionary and Development/Management SystemTerminology Dictionary and Development/Management System
Development/Management System of Electronic Dictionary for Sentence Analysis/Generation (100,000 entries)Development/Management System of Electronic Dictionary for Sentence Analysis/Generation (100,000 entries)
Korea Terminology Research Center for Language and Knowledge Engineering
Development Tools
Korean Concordance Program (KCP) Compound Noun Browser Corpus Browser Corpus Browser by Category Automatic English-to-Korean Transliteration System (TLEK) KAIST Ontology Browser Korean Morphological Analyser Korean Tagger Korean Syntactic Analyser Editing Support Tools to Electronic Dictionary
Korea Terminology Research Center for Language and Knowledge Engineering
Results & Distribution
Major Results The first (KIBS I) : 1997.6. ~ present (80 site)
Text corpus 10 million word phrases POS tagged corpus 1 million word phrases Syntactic structure tagged corpus 10 thousands sentences TDMS, Speech DB samples, Hand-written character DB samples
The second (KIBS II) : 1998.12. ~ present (140 site) Raw corpus 10 million word phrases, POS tagged corpus – 200 thousands
word phrases The third (KIBS III) : 2000 (pending)
Proper noun 10 thousands entries, Compound noun 20 thousands entries, Verb sentence pattern dictionary 3 thousands entries, ...
Plan to maintain and distribute ...
Korea Terminology Research Center for Language and Knowledge Engineering
Integration of Electronic Dictionaries
Dictionaries: total 420K entries (estimated now) Machine Readable Dictionary (Hangul Society): 200K entries Compound Noun, Proper Noun Classification, Internal Semantic
Structure: 50K entries Searched Compound Noun, Proper Noun: open Verb Subcategorization: 10K frames (K-J comparison) Thesaurus: Korean-Japanese-Chinese-English – not so good qu
ality – 150K entries Usage from corpus for each sense Functional words
Problem Sense classification standardization Character code: Korean, Japanese, Chinese, … (most important
problem) – now under unicode transfer
Korea Terminology Research Center for Language and Knowledge Engineering
Open through web:
Corpus KWIC for Korean and Japanese http://morph.kaist.ac.kr/kcp/
Korean morphological analysis service http://morph.kaist.ac.kr/ By email, if send a text file, then reply its POS taggin
g Graphic editor/debugger for Korean morphology
Project Status http://kibs.kaist.ac.kr/
KORTERM
Korea Terminology Center for Language and Knowledge Engineering
http://korterm.org/ (English)http://korterm.or.kr/ (Korean)http://eafterm.org/ (East Asian Terminology)
Korea Terminology Research Center for Language and Knowledge Engineering
Goals of KORTERM
Through World-Wide Terminology Collection and Their Standardization and Harmonization in Local Society
Distribution, Publication and Application in Language and Knowledge Engineering are promoted.
Through Education and Consultation of Terminology R&D Methodology for Each Subject Field,
High-Quality, High-Reliable Terminology and Its Infrastructure and System are achieved.
Center of Terminology and Knowledge Engineering
Korea Terminology Research Center for Language and Knowledge Engineering
Phases and Subjects of KORTERM
Integration of Working Terminology•Terminology Collection (Basic S&T, Industry Standard, Economics)•Electronic Terminology (Publication)•R&D Environment (System Standardization)•Terminology Theory and Education Infrastructure
Value-Added Terminology Integration•Terminology Collection (Extended S&T) •Extension & Maintenance (Industry Standards)•High-Quality Terminology•Application in Language Industry•Verification for High-Reliability and Distribution
Multi-lingual Terminology Integration •Terminology Collection (Humanity and Social Science)•Maintenance and Extension •Large-Scale Knowledge Base for Terminology•Terminology Education Curriculum Development•Application Product Development
Continuous Extension and Management•Terminology Study Promotion•Distribution of Terminology Information Base•Continuous Terminology Extension and Management
Phase 2(2001-2003)
Value-Added Working System
Phase 3(2004-2007)Operation
Phase 4(2008 - )
Maintenance and Extension
Phase 1(1998-2000)
R&D Environment and Basic Data Collection
Korea Terminology Research Center for Language and Knowledge Engineering
Basic Data (Corpus) Corpus for Each Subject Domain
Electronic Dictionary for Basic Vocabulary Everyday Vocabulary consists of General Vocabulary and Everyd
ay Terminology
Internationalization of Korean Language South-North Korean Terminology Standardization, Korean langua
ge Input Methods
Korean Language Engineering Standardized Term Use for Information Retrieval, Machine Trans
lation and Document Classification
R & D (1)
Korea Terminology Research Center for Language and Knowledge Engineering
Language Engineering Information Retrieval:
Effective Internet Information Creation and Information/Knowledge Acquisition
Multi-lingualism
Machine Translation: Efficient Information Generation through Terminology and Vo
cabulary Collection and Standardization
Wordprocessor: High Productivity by Spelling Correction, Summarization and
Efficient Use.
R & D (2)
Korea Terminology Research Center for Language and Knowledge Engineering
Language, Information and Terminology Language Education:
Technical Thinking and Technical Communication Terminology-based Education
Language Study: Domain-specific Language Study
R & D (3)
Korea Terminology Research Center for Language and Knowledge Engineering
Terminology Sponsors
Support from Government, Organization and Industry according to each specialty Ministry of Culture and Tourism (KORTERM Center Operat
ion) Ministry of Science and Technology (R&D Fund) Ministry of Information and Telecommunication (R&D Fun
d) Ministry of Diplomacy and Trade Ministry of Industry and Resource Ministry of Education Korea Science and Technology Foundation (Event Support)
Korea Terminology Research Center for Language and Knowledge Engineering
Task Configuration
Terminology Base (Collection)Non-standards
International Term StandardTerminology Standard
Language& Knowledge Product
LanguageEducationEnvironment
Terminology Information Environment
R&
D
En
vironm
ent
Ap
plication
Us
e
Term
inology
Sym
bolization
Terminology Access Standard Channel
Grid Size Controller
Application-Specific Dictionary
Language Education Adaptable to Student
R&D Industry Living Communication
Standardization & Harmonization
TerminologicalConceptual
Space
Large-Scale Speech/Language/Image DB Construction a
nd Evaluation
Supported by Ministry of Science and Technology
Two Year Project (1999.10-2001.10)
Korea Terminology Research Center for Language and Knowledge Engineering
Goals Speech/Language/Image Evaluation Standardization Speech/Language/Image Evaluation StandardizationFinal GoalFinal Goal
OrganizationOrganization
Test SuiteTest Suite
•Working Group Organization•Survey and Planning•Working Group Organization•Survey and Planning
Specification Standardization
Specification Standardization
•IR Test Suite and Evaluation Model Recommend•MT Test Suite and Evaluation Model Recommend•IR Test Suite and Evaluation Model Recommend•MT Test Suite and Evaluation Model Recommend
•Image Attribute Format•Color-Lexical Entry •MPEG7 Specification
•Image Attribute Format•Color-Lexical Entry •MPEG7 Specification
LanguageLanguage
•Sentence-unit Speech DB•Prosody for Speech Synthesis•Sentence-unit Speech DB•Prosody for Speech SynthesisSpeech
Speech
ImageImage
LanguageLanguage
SpeechSpeech
ImageImage
•IR/QA 90 query/200K doc, MT 5,000 sentences•IR/QA 90 query/200K doc, MT 5,000 sentences
•word-unit telephone speech DB: 100 token * 500•word-unit telephone speech DB: 100 token * 500
•Image 300 kinds - Meta Data•Image 300 kinds - Meta Data
Korea Terminology Research Center for Language and Knowledge Engineering
Question-Answering IR Test Suites
Test Suites for IR/QA Documents
207,067 records (370MB) Newspapers
Query Generation 90 queries (through 300 quiz query analysis) Queries for WH-question and other various types of answers for NLP problem solving relevent document set to include the answer by using four kinds of commercialized IR systems by 16 kind
s of methods
Korea Terminology Research Center for Language and Knowledge Engineering
English-Korean MT Test Suites
Type Classification: About 300 KindsTest Sentences and Test Query: 5,000 Records
Extracted from Textbook and Grammar books (1999-2000)
will be extracted from the Real usage like web, newspapers (2000-2001)
Evaluation by Yes/No Question Tested for 4 Commercialized English-Korean MT Syst
ems
Korea Terminology Research Center for Language and Knowledge Engineering
MT Evaluation Workbench
Korea Terminology Research Center for Language and Knowledge Engineering
Image Meta Data Editor
Meta data Input Workbenchby XML
Korea Terminology Research Center for Language and Knowledge Engineering
Image Retrieval by Meta data
Korea Terminology Research Center for Language and Knowledge Engineering
http://korterm.kaist.ac.kr/ksurimal/