CYK Parser

transcript

CYK Parser

Von Carla und Cornelia Kempa

Overview

Top-down Bottom-up

Non-directional methods

Unger Parser CYK Parser

Cocke Younger Kasami -method

Recognition phase

Example grammar

• Number(s) Integer | Real• Integer Digit | Integer Digit• Real Integer Fraction Scale

• Fraction . Integer

• Scale e Sign Integer | Empty• Digit 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9• Empty ɛ• Sign + | -

Example Sentence: 32.5e+1

• 1. concentrate on the substrings of the input sentence

Building the recognition table

32.5e +1 is in the language

• What problems can we already see in this example?

Another complication: Ɛ- rules

Input : 43.1

The ɛ- Problem

Shortest substrings of any input sentence :

ɛ-substrings

We must compute Rɛ the set of non-terminals that derive ɛ

Rɛ = { Empty, Scale }

Non- empty substrings of the input sentence

• Input : z = z1 z2 z3 z4 ….zn

• Compute the set of Non-Terminals

that derive the substring of z starting at position i, of length l.

Terminology (also on the handout)

• i index we are starting at

• l length of this substring

• R s i,l set of Non-Terminals deriving the substring s i, l

• S i, 0 = ɛ

• Set of Non- Terminals that derive ɛ :

R s i,0 = R ɛ

S i, l =z i z i+1 …… z i+ l-1

The set of Non- Terminals deriving the substring s i, l : R s i, l

1.) substrings of length 0

S i, 0 = ɛ and R s i, l = R ɛ

2.) short substrings

3.) longer substrings (say l = j )

All the information on substrings with

l < j is available

Check each RH-side (Right-Hand -side) in

the grammar to see if it derives s i, l

• L A1 ….Am

S i, l

( divided into m segments (= possibly empty))

A1 first segment of s i, l

A2 second segment of s i, l

…. ….

A 1 ….Am s i,l

• So A1 first part of s i,l

(let´s say A1 has to derive a first part of

s i, l of length k)

A1 s i, k

A1 is in the set R s i,k

A 1 ….Am s i,l

• Assuming this A2…Am has to derive the rest:

A2 … Am Si+k, l-k

This is attempted for every k

Problems with this Approach

1) Consider A2…Am

m could be 1 and A1 a Non-terminal

We are Dealing with a unit- rule

A1 must derive the whole substring

s i, l and thus be a member of R s i, l

But that´s the set we are computing right

now …

Solution to this problem

• A1 s i, l

• Somewhere along the derivation there must be a first step not using a unit rule

A1 B … C * s i, l

C is the first Non-Terminal using a

non-unit-rule in the derivation

Solution cont.

At some stage C is added to Rs i, l

If we repeat the process again and again

At some point B will be added and in the next step A1 will be added

We have to repeat the process again and again until no new Non-Terminals are added to R s i,l

Problem 2

Ɛ-rulesConsider all but one of the At derive Ɛ

B A1 A2 A3 A4 A5 …. AtB and A1 - t are Non-TerminalsA2 – At derive ƐSo what stays is : B A1 A unit-rule

We have computed all the Rs i,l

• If S is a member of Rs 1, n the start symbol derives z (=s 1, n) (the input string)

CYK recognition with a grammar in ****- form:

• What are the Restrictions we want to have on our grammar ?

Useful Restrictions

• No ɛ- rules• No unit-rules• Limit the length of the right- hand side of each rule, say

to two

• What we get out of this:• A a• A BC

• Where a is a terminal and ABC are Non- Terminals

Chomsky-Normal-Form…

(… not only to annoy students )

• Perfect grammar for CYK

How CYK works for a grammar in CNF

• Rɛ is empty

• R s i, 1 can be read directly from the rules

A rule A BC

can never derive a single terminal

Procedure

• Iteratively (as before) :• 1) Fill the sets R s, 1 directly• 2) Process all substrings of length 1 • 3) Process all substrings of length 2• 4) Process all substrings of length l• For the first step we use the rules of the form A

a• For all the following steps we have to use the

rules of the form: A BC

CYK and CNF

Question the CYK-Parser has to answear is:

Does such a k exist?

Answearing this question is easy:

• Just try all possibilities

• no problem since you are a computer ;-)

• Range : from 1 to (l-1)

• All the sets R s i,k and R s i+k , l-1

have already been computed at this point

Transform our sample CF-grammar into Chomsky Normal Form

• Overview

• 1) eliminate ɛ-rules

• 2) eliminate unit-rules

• 3) remove non-productive non-terminals

• 4) remove non –reachable non-terminals

• 5) modify the rest until all grammar rules are of the form A a , A BC

Our number grammar in CNF• Number(s) 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9• Number(s) Integer Digit• Number(s) N1 Scale´ | Integer Fraction• N1 Integer Fraction• Integer 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9• Integer Integer Digit• Fraction T1 Integer

• T1 .• Scale ´ N2 Integer• N2 T2 Sign • T2 e• Digit 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9• Sign + | -

• Input :

Our example grammar in CNF

input sentence: 32.5 e + 1

• 1) bottom-row : read directly from the grammar (rules of the form A a )

• 2) Check each RHS in the grammar

Check each RHS of the grammar

• Two Ways: Example: 2.5 e ( = s 2, 4)

• 1) check each RHS e.g N1 Scale´

• 2) compute possible RH-Sides from the recognition table

How this is done

1) N1 not in R s 2, 1 or R s 2, 2N1 is a member of R s 2, 3But Scale´ is not a member of R s 5, 1

2) R s 2, 4 is the set of Non- Terminals that have a RHS AB where either:

A in R s 2, 1 and B in R s 3, 3A in R s 2, 2 and B in R s 4, 2A in R s 2, 3 and B in R s 5, 1Possible combinations: N1 T2 or Number T2In our grammar we do not have such a RHS, so nothing is

added to R s 2, 4.

Recognition table

Recognition table (well-formed substring table)

Computing R s i, l:follow the arrows V and W simultaneously

A BC ,

B a member of a set on the V arrow ,

C a member of a set on the W arrow

Comparison

• This process is much less complicated than the one we saw before

• Why?

Conclusion

» This process is much less complicated

• Reasons:

1) We do not have to repeat the process again and again until no new Non-Terminals are added to R s i,l

(The substrings we are dealing with

are really substrings and cannot be equal to the string we start with)

Reasons cont.

2) We only have to find one place where the substring must be split into two

Here !

Result of the algorithm we have seen so far:

• Complete collection of sets R s i, l

• These sets can be organized in a triangular table:

Cost of CYK - algorithm

• Operations dependent on n,

the number of input symbols:

• (n * ( n+1) ) / 2 substrings to be examined

• For each substring : n-1 different k-positions as the worst case

Cost of CYK – algorithm cont.

• All other operations are independent of n The algorithm works in a time at most

proportional to n ³

That´s far more efficient than exhaustive search (time exponential in the length of the input sentence)

CYK Parser

Documents