Date post: | 14-Dec-2014 |
Category: |
Science |
Upload: | vladimir-kulyukin |
View: | 79 times |
Download: | 4 times |
Speech & NLP
Syntax & Parsing
Vladimir Kulyukin
Outline
Syntax & Parsing
Context-Free Grammars
Definition
Epsilon Productions
Useful & Useless Symbols
Chomsky Normal Form (CNF)
Cocke-Younger-Kasami Algorithm & CFL Membership
Problem
Syntax & Parsing
Syntax in the NLP context refers to the study of
sentence or text structure
Parsing is the process of assigning a parse tree to
a string
A grammar is required to generate parse trees
Grammars for natural languages consists of
syntactic categories and parts of speech
Context-Free Grammars
Context-Free Grammar (CFG): Definition
.
over string a is and where, form theof
is productioneach s;production ofset finite a is
symbol;start theis
alphabet; terminal theis
alphabet; lnontermina theis
where,,,, tuple-4 a is CFG A
VΣ
yVX yX
P
VS
Σ
V
PSVGG
A Sample NL CFG
S NP VP
S AUX NP VP
S VP
NP DET NOMINAL
NOMINAL NOUN
NOMINAL NOUN NOMINAL
NP ProperNoun
NP VERB
DET that | this | a
NOUN left
AUX does
VERB make
PREP from | to | on
ProperNoun USU
NOMINAL NOMINAL PP
PP PREP NP
Formal Context-Free Languages
Example 01
. from derived is that show ,on induction
By .| :CFG following heConsider t :Proof
free.-context is L that Show .0 Let :Claim
Sban
aSbS
|nbaL
nn
nn
Example 02
. from derived is that show ,on induction
By .| :CFG following heConsider t :Proof
free.-context is that Show .0|Let :Claim
3
3
Sban
aSbbbS
LnbaL
nn
nn
Useful & Useless Symbols in CFGs
Useful & Useless Symbols
Let G = (V, T, P, S) be a CFG grammar
A symbol X is useful if there is a derivation S * αXβ * w for some α, β in (V U T)* and w is in T*
A symbol X is useless if there is no such derivation
Useful & Useless Symbols
Let G = (V, T, P, S) be a CFG grammar
A symbol X is useful if there is a derivation S * αXβ * w for some α, β in (V U T)* and w is in T*
A symbol X is useless if there is no such derivation
Example: Useful & Useless Symbols
Suppose CFG G has the following productions:
S AB | a
A a
A, B are useless symbols
S is a useful symbol
Elimination of Useless Symbols & ɛ-Productions
Every non-empty context-free language that does not contain ɛ can be generated by a grammar with no useless symbols or ɛ-productions
Chomsky Normal Form (CNF)
A grammar G = (V, T, S, P) is said to be in Chomsky Normal Form (CNF) if each production in P has the following form: 1)A BC 2)A a
where A, B, C are in V and a is in T
CNF Theorem
Let G be a grammar with no useless symbols
and no ε-productions. There is a CNF grammar
G’ such that L(G) = L(G’).
Cocke-Younger-Kasami (CYK) Algorithm
CYK Algorithm’s Problem
Problem: Given a CFG G = (V, T, P, S) and a string x in T*, determine if x is in L(G)?
The Cocke-Younger-Kasami (CYK) algorithm takes a CFG in CNF and a string and determines if S is one of the symbols that derive x
Substring Notation xsl
Let x be a string such that |x|= n ≥ 1
Let xsl be the substring of x of length l that starts at position s, 1≤ s ≤ n and 1≤ l ≤ n
For example, if x = aabbabb, then x13 = aab = x[1]x[2]x[3] and x24 = abba = x[2]x[3]x[4]x[5]
In general, if we do 1-based array indexing and the length of the substring is l, the last available position s at which the substring can start is n – l + 1
For example, if |x| = 4 and l = 2, the possible values for s in xs2 are 1, 2, and 3 = 4 – 2 + 1
CYK Algorithm: Basic Insight
A
B C
xsk x(s+k)(l-k)
s s+k s+l s+k-1
xsl
A * xsl iff
1) A BC;
2) B * xsk;
3) C * x(s+k)(l-k), for some k, 1 ≤ k < l
In other words, to determine if A
* xsl there must be a rule A BC
and some k, 1 ≤ k < l, for which B
* xsk and C * x(s+k)(l-k).
Table D[s, l]
CYK is a dynamic programming algorithm that,
given a CNF grammar G = (V, T, S, P) and a string
x over a specific alphabet such that |x|= n > 0,
incrementally builds a n x n table D (D stands for
‘derives’)
D[s, l] is a set, possibly empty, of symbols A in V
such that A * xsl
In other words D[s, l] records all variables in G
that derive xsl
Table D[s, l]
CYK is a dynamic programming algorithm that,
given a CNF grammar G = (V, T, S, P) and a string
x over a specific alphabet such that |x|= n > 0,
incrementally builds a n x n table D (D stands for
‘derives’)
D[s, l] is a set, possibly empty, of symbols A in V
such that A * xsl
In other words D[s, l] records all variables in G
that derive xsl
D[s, l] Initialization
Let G = (V, T, S, P) be a CNF grammar and x be a
string such that |x|= n > 0,
Let xsl be the substring of x of length l that starts
at position s
If l = 1, then, for each 1≤ s ≤ n, we can check if
xs1 can be derived directly from some variable A
of G
How? By checking if G has a production A xs1
D[s, l] Initialization
Assume that our CNF grammar is as follows:
1. S AB | BC
2. A BA | a
3. B CC | b
4. C AB | a
Assume that the input is x = baaba
What does D[s, l] look like?
5 x 5 D[s, l]
s
1 2 3 4 5
1
2
3
4
5
l
Computing D[1,1]
The input is x = baaba
The 1st symbol of the input is b
Thus, D[1,1] = {A | A b},
where A is in V
There is only one production
that qualifies: B b
So D[1,1] = {B}
G’s Productions:
1. S AB | BC
2. A BA | a
3. B CC | b
4. C AB | a
D[s, l] So Far
{B}
s
1 2 3 4 5
1
2
3
4
5
l
Computing D[2,1]
The input is x = baaba
The 2nd symbol of the input is a
We compute {A | A a} , where A is in
V
There are two such productions: A a,
C a
So D[2, 1] = {A,C}
G’s Productions:
1. S AB | BC
2. A BA | a
3. B CC | b
4. C AB | a
D[s, l] So Far
{B} {A, C}
s
1 2 3 4 5
1
2
3
4
5
l
Computing D[3,1]
The input is x = baaba
The 3rd symbol of the input is a
We compute {A | A a} , where A
is in V
There are two such productions: A
a, C a
So D[3, 1] = {A,C}
G’s Productions:
1. S AB | BC
2. A BA | a
3. B CC | b
4. C AB | a
D[s, l] So Far
{B} {A, C} {A, C}
s
1 2 3 4 5
1
2
3
4
5
l
Computing D[4,1]
The input is x = baaba
The 4th symbol of the input is b
Thus, D[4,1] = {A | A b}, where
A is in V
There is only one production that
qualifies: B b
So D[4,1] = {B}
G’s Productions:
1. S AB | BC
2. A BA | a
3. B CC | b
4. C AB | a
D[s, l] So Far
{B} {A, C} {A, C} {B}
s
1 2 3 4 5
1
2
3
4
5
l
Computing D[5,1]
The input is x = baaba
The 5th symbol of the input is a
We compute {A | A a} , where A
is in V
There are two such productions:
A a and C a
So D[5, 1] = {A,C}
G’s Productions:
1. S AB | BC
2. A BA | a
3. B CC | b
4. C AB | a
D[s, l] So Far
{B} {A, C} {A, C} {B} {A, C}
s
1 2 3 4 5
1
2
3
4
5
l
Computing D[1,2]
We need to find k, such that 1 ≤ k < 2 and look for productions A BC where B is in D[1,1] and C is in D[2,1]
Since D[1,1] = {B} and D[2,1] = {A, C}, the possibilities for the right-hand sides are {B} x {A, C} = {BA, BC}
The rules that match these possibilities are S BC and A BA
So D[1,2] = {S,A}
G’s Productions:
1. S AB | BC
2. A BA | a
3. B CC | b
4. C AB | a
D[s, l] So Far
{B} {A, C} {A, C} {B} {A, C}
{S, A}
s
1 2 3 4 5
1
2
3
4
5
l
Computing D[2,2]
We need to find k, such that 1 ≤ k < 2,
and the rules A BC, where B is in
D[2,1] and C is in D[3,1]
Since D[2,1] = {A,C} = D[3,1] = {A,C},
the right-hand side possibilities are
AA, AC, CA, CC
There is only one rule that qualifies: B
CC
So D[2,2] = {B}
G’s Productions:
1. S AB | BC
2. A BA | a
3. B CC | b
4. C AB | a
D[s, l] So Far
{B} {A, C} {A, C} {B} {A, C}
{S, A} {B}
s
1 2 3 4 5
1
2
3
4
5
l
Computing D[3,2]
We look for k, such that 1 ≤ k < 2 and
rules of the form A BC, where B is in
D[3,1] and C is in D[4,1]
D[3,1] = {A,C} and D[4,1] = {B}
So the right-hand side (RHS) possibilities
are AB, CB
The rules whose RHS’s that match these
possibilities are: S AB and C AB
So D[3,2] = {S,C}
G’s Productions:
1. S AB | BC
2. A BA | a
3. B CC | b
4. C AB | a
D[s, l] So Far
{B} {A, C} {A, C} {B} {A, C}
{S, A} {B} {S, C}
s
1 2 3 4 5
1
2
3
4
5
l
Computing D[4,2]
We look for k, such that 1 ≤ k < 2 and
rules of the form A BC, where B is
D[4,1] and C is in D[5,1]
V[4,1] = {B}; V[5,1] = {A,C}
So the RHS possibilities are BA and BC
The rules whose RHS’s that match these
possibilities are: S BC and A BA
So D[4,2] = {S,A}
G’s Productions:
1. S AB | BC
2. A BA | a
3. B CC | b
4. C AB | a
D[s, l] So Far
{B} {A, C} {A, C} {B} {A, C}
{S, A} {B} {S, C} {S, A}
s
1 2 3 4 5
1
2
3
4
5
l
Computing D[1,3]
We look for k, such that 1 ≤ k < 3 and rules of the form A BC, where, for k = 1, B is in D[1,1] and C is in D[2,2] or where, for k = 2, B is in D[1,2] and C is in D[3,1]
For k = 1, D[1,1] = {B} and D[2,2] = {B}, so there is only one right-hand side possibility: BB
The grammar does not have any productions whose right-hand side is BB
For k = 2, D[1,2] = {S,A} and D[3,1] = {A,C}, so the RHS possibilities are: SA, SC, AA, AC
The grammar does not have any productions whose RHS’s are SA, SC, AA, AC
So D[1,3] = { }
G’s Productions:
1. S AB | BC
2. A BA | a
3. B CC | b
4. C AB | a
D[s, l] So Far
{B} {A, C} {A, C} {B} {A, C}
{S, A} {B} {S, C} {S, A}
{ }
s
1 2 3 4 5
1
2
3
4
5
l
Computing D[2,3]
We look for k, such that 1 ≤ k < 3 and rules of the form A BC, where, if k = 1, B is in D[2,1] and C is in D[3,2] or where, if k = 2, B is in D[2,2] and C is in D[4,1]
For k = 1, D[2,1] = {A,C} and D[3,2] = {S,C}
The RHS possibilities are: AS, AC, CS, CC
The only rule that matches is B CC
For k = 2, D[2,2] = {B} and D[4,1] = {B}
The possibilities are: BB
No rules match
So D[2,3] = {B}
G’s Productions:
1. S AB | BC
2. A BA | a
3. B CC | b
4. C AB | a
D[s, l] So Far
{B} {A, C} {A, C} {B} {A, C}
{S, A} {B} {S, C} {S, A}
{ } {B}
s
1 2 3 4 5
1
2
3
4
5
l
Rest of D[s, l]
{B} {A, C} {A, C} {B} {A, C}
{S, A} {B} {S, C} {S, A}
{ } {B} {B}
{ } {S, A, C}
{S, A, C}
s
1 2 3 4 5
1
2
3
4
5
l
Is x=baaba Accepted?
Yes, because D[1,5] contains S. It means that S * xsl.
In other words, the substring of x that starts at 1 and
has a length of 5 is derivable from S.
How & Why CYK Works
CYK runs in O(n3), where |x| = n > 0
Both k and l-k are strictly less than l
If we know that each of the two smaller
derivations exists (i.e. B * xsk and C *
x(s+k)(l-k)), we can determine if A BC
When we reach l=n, we can determine if
S* x1n
References & Reading Suggestions
Hopcroft and Ullman. Introduction to Automata
Theory, Languages, and Computation, Narosa
Publishing House
Moll, Arbib, and Kfoury. An Introduction to Formal
Language Theory
Jurafsky & Martin. Speech & Language Processing.
Prentice Hall.
www.youtube.com/vkedco