Statistische Methoden in der ComputerlinguistikStatistical Methods in Computational Linguistics
1. Course Overview
Jonas Kuhn
Universität Potsdam, 2007
Outline
Course Overview & Introduction Some Python Programming
Course Overview
Simple Python Programming Basic Probability Theory N-Gram Language Modeling
Basic Information Theory: Entropy Data Sparseness & Smoothing Techniques
Machine Learning Paradigms Part-of-Speech-Tagging with Statistical and ML
Techniques Probabilistic Grammars & Parsing Statistical Machine Translation
The Status of Statistical Methods
Eric Brill and Raymond J. Mooney (1997):
An Overview of Empirical Natural Language Processing
In: AI Magazine, 18(4): Winter 1997, 13-24.
The linguistic knowledge-acquisition problem
Rationalist methods Empirical or corpus-based methods
Rationalist methods
Empirical or corpus-based methods
History of NLP
1950s: empirical and statistical analyses of natural language (compare: behaviorism in psychology; Skinner)
Mid-1950s: Chomsky’s program
Observational and explanatory adequacy Arguments against learnability of language from data;
Innateness hypothesis Rationalist methods in AI research in NLP
Hand coding of rules Starting in early 1980s
Some work on induction of lexical and syntactic information from text
Empirical methods in speech recognition (hidden Markov models; HMMs)
History of NLP
Late 1980s/1990s: Statistical techniques in various areas of NLP POS tagging Machine translation Probabilistic context-free grammars Word sense disambiguation Anaphora resolution
Reasons for the Resurgence of Empiricism Empirical methods offer potential solutions to several
related, long-standing problems in NLP: (1) Acquisition, automatically identifying and coding
all the necessary knowledge (2) Coverage, accounting for all the phenomena in a
given domain or application (3) Robustness, accommodating real data that
contain noise and aspects not accounted for by the underlying model
(4) Extensibility, easily extending or porting a system to a new set of data or a new task or domain
Reasons for the Resurgence of EmpiricismAdditional factors: (1) computing resources, the availability of relatively
inexpensive workstations with sufficient processing and memory resources to analyze large amounts of data
(2) data resources, the development and availability of large corpora of linguistic and lexical data for training and testing systems
(3) emphasis on applications and evaluation, industrial and government focus on the development of practical systems that are experimentally evaluated on real data
Categories of Empirical Methods (1)
Probabilistic methods Symbolic learning methods Neural network/connectionist methods
Categories of Empirical Methods (2)
Different dimension: type of training data Supervised learning
Annotated text Unsupervised learning
Indirect feedback
Important: combination of rationalist and empirical methods
An Interdisciplinary Field
Computational Neuroscience
Computer Science
Linguistics
Mathematics
Electrical Engineering
Artificial Intelligence
Computational Linguistics
Philosophy
Algorithms &Data Structures
SearchAlgorithms
MachineLearningNeural
Networks
Natural LanguageParsing
GrammarFormalisms
ComplexityTheory Formal Language Theory
Probability Theory
InformationTheory
Pattern/SpeechRecognition
InformationRetrieval
Clustering
Corpus Linguistics
Empirical Sciences
Statistics
Psycho-linguistics
StatisticalNLP
Practical Aspects
We will use Python for small programming exercises
http://www.python.org/ NLTK library (in Python) – Natural Language
Toolkithttp://nltk.sourceforge.net/
(probably) WEKA for small Machine Learning experimentshttp://www.cs.waikato.ac.nz/ml/weka/
Python
Tutorial introduction in an NLP context:http://nltk.sourceforge.net/docs.html Chapter 2: Programming
Python: Key Features
Simple yet powerful, shallow learning curve Object-oriented: encapsulation, re-use Scripting language, facilitates interactive exploration Excellent functionality for processing linguistic data Extensive standard library, incl graphics, web,
numerical processing Downloaded for free from http://www.python.org/
Slide taken from: Bird/Loper/Klein: NLTK – Introduction to NLP
Python example
import sys
for line in sys.stdin.readlines():
for word in line.split():
if word.endswith(’ing’):
print word
1. whitespace: nesting lines of code; scope
2. object-oriented: attributes, methods (e.g. line)
3. readable
Slide taken from: Bird/Loper/Klein: NLTK – Introduction to NLP
Comparison with Perl
while (<>) {foreach my $word (split) {
if ($word =~ /ing$/) {print "$word\n";
}}
}
1. syntax is obscure: what are: <> $ my split ?2. “it is quite easy in Perl to write programs that simply
look like raving gibberish, even to experienced Perl programmers” (Hammond Perl Programming for Linguists 2003:47)
3. large programs difficult to maintain, reuse
Slide taken from: Bird/Loper/Klein: NLTK – Introduction to NLP
What NLTK adds to Python
NLTK defines a basic infrastructure that can be used to build NLP programs in Python. It provides: Basic classes for representing data relevant to natural
language processing Standard interfaces for performing tasks, such as
tokenization, tagging, and parsing Standard implementations for each task, which can be
combined to solve complex problems Extensive documentation, including tutorials and
reference documentation
Slide taken from: Bird/Loper/Klein: NLTK – Introduction to NLP
Installing Python and NLTK
1. Install Python, Numeric
2. Install NLTK-Lite, NLTK-Lite-Corpora
3. Set environment variable NLTK_LITE_CORPORA
For detailed instructions, see:http://nltk.sourceforge.net/install.html
Running Project Idea
Language Identification In what language is a given text document?
First ideas?
(Using simple text processing techniques)