1 A Random Text Model for the Generation of Statistical Language Invariants Chris Biemann University...

1

A Random Text Model for the Generation of

Statistical Language Invariants

Chris BiemannUniversity of Leipzig, Germany

HLT-NAACL 2007, Rochester, NY, USA

Monday, April 23, 2007

2

Outline

• Previous random text models

• Large-scale measures for text

• A novel random text model

• Comparison to natural language text

3

Necessary property: Zipf‘s Law• Zipf: Ordering words in a corpus by descending frequency, the

relation between the frequency of a word at rank r and its rank is given by f(r) ~ r-z, where z is the exponent of the power-law that corresponds to the slope of the curve in a log plot. For word frequencies in NL, z 1

• Zipf-Mandelbrot: f(r) ~(r+c1)-(1+c2): Approximates lower frequencies for very high ranks

1

10

100

1000

10000

1 10 100 1000 10000

fre

qu

en

cy

rank

rank-frequency

spoken Englishpower law z=1.4

Zipf-Mandelbrot c1=10 c2=0.4

4

Previous Random Text ModelsB. B. Mandelbrot (1953)• Sometimes called the “monkey at the typewriter”• With a probability w, a word separator is generated at each step, • with probability (1-w)/N, a letter from an alphabet of size N is

generated

H. A. Simon (1955)• No alphabet of single letters• at each time step, a previously unseen new word is added to the

stream with a probability , whereas with probability (1-), the next word is chosen amongst the words at previous positions.

• frequency distribution that follows a power law with exponent z=(1-). • Modified by Zanette and Montemurro (2002):

- sublinear growth for higher exponents- Zipf-Mandelbrot law by maximum probability threshold

5

Critique on Previous Models

• Mandelbrot: All words with the same length are equiprobable, as all letters are equiprobable Ferrer i Cancho and Solé (2002): Initialisation with letter probabilities obtained from natural language text solves this problem, but where do these letter frequencies come from?

• Simon: No concept of „letter“ at all.

• Both:– no concept of sentence– no word order restrictions: Simon = bag of words,

Mandelbrot does not take into account generated stream at all

6

Large-scale Measures for Text

• Zipf‘s law and lexical spectrum: rank-frequency plot should follow a power law with z1, frequency-spectrum (probability of frequencies) should follow a power law with z2 (Pareto distribution)

• Word length: Should be distributed like in natural language text, according to a variant of the gamma distribution (Sigurd et al. 2004)

• Sentence length: Should also distributed like in NL, same gamma distribution

• Significant neighbour-based co-occurrence graph: Should be a similar in terms of degree distribution and connectivity in random text and NL.

7

A Novel Random Text Model

Two parts:• Word Generator• Sentence Generator

Both follow the principle of beaten tracks: • Memorize what has been generated before• Generate with higher probability if generated before more

often

Inspired by Small World network generation, especially (Kumar et al. 1999).

8

Word Generator• Initialisation: – Letter graph of N letters. – Vertices are connected to themselves with weight 1.

• Choice: – When generating a word, the generator chooses a letter x according to its

probability P(x), which is computed as the normalized weight sum of outgoing edges:

• Parameter: – At every position, the word ends with a probability w(0,1) or generates a next

letter according to the letter production probability as given above.

• Update: – For every letter bigram, the weight of the directed edge between the preceding

and current letter in the letter graph is increased by one.

• Effect: self-reinforcement of letter probabilities: – the more often a letter is generated, the higher its weight sum will be in

subsequent steps, – leading to an increased generation probability.

Vv

vweightsum

xweightsumxP

)(

)()(

)(

),()(yneighu

uyweightyweightsumwith

9

Word Generator Example

The small numbers next to edges are edge weights. The probability for the letters for the next step are

P(A)=0.4 P(B)=0.4 P(C)=0.2

10

Measures on the Word Generator

• Word Generator fulfills measures much better than the Mandelbrot model.• For other measures, we need something extra...

1

10

100

1000

10000

1 10 100 1000 10000

frequ

ency

rank

rank-frequency

word generator w=0.2power law z=1

Mandelbrot model

1e-006

1e-005

0.0001

0.001

0.01

0.1

1

1 10 100 1000

P(fre

quen

cy)

frequency

lexical spectrum

word generator w=0.2power law z=2

Mandelbrot model

11

Sentence Generator I• Initialisation:

– Word graph is initialized with a begin-of-sentence (BOS) and an end-of-sentence (EOS) symbol, with an edge of weight 1 from BOS to EOS.

• Word Graph: (directed)– Vertices correspond to words – edge weights correspond to the number of times two words were

generated in a sequence.

• Generation:– random walk on the directed edges starts at the BOS vertex. – With a new word probability (1-s), an existing edge is followed from

the current vertex to the next vertex – the probability of choosing endpoint X from the endpoints of all

outgoing edges from the current vertex C is given by

)(

),(

),()(

CneighN

NCweight

XCweightXwordP

12

Sentence Generator II• Parameter:

– With probability s (0,1), a new word is generated by the word generator model

– next word is chosen from the word graph in proportion to its weighted indegree: the probability of choosing an existing vertex E as successor of a newly generated word N is given by

• Update:– For each sequence of two words generated, the weight of the

directed edge between them is increased by 1

Vv

Vv

XvweightXindgw

vindgw

EindgwEwordP

),()(

,)(

)()(

13

Sentence Generator Example

• In the last step, the second CA was generated as a new word from the word generator.

• The generation of empty sentences happens frequently. These are omitted in the output.

14

Comparison to Natural Language• Corpus for comparison: The first 1 million words of BNC, spoken

English.• 26 letters, uppercase, punctuation removed same in word generator• 125,395 sentences set s=0.08, remove first 50K sentences• average sentence length: 7.975 words • Average word length: 3.502 letters w=0.4

OOH

OOH

ERM

WOULD LIKE A CUP OF THIS ER

MM

SORRY NOW THAT S

NO NO I DID NT

I KNEW THESE PEWS WERE HARD

OOH I DID NT REALISE THEY WERE THAT BAD I FEEL SORRY FOR MY POOR CONGREGATION

15

Word Frequency

1

10

100

1000

10000

1 10 100 1000 10000

frequ

ency

rank

rank-frequency

sentence generatorEnglish

power law z=1.5• Zipf-Mandelbrot

distribution

• Smooth curve

• Similar to English

16

Word Length

• More 1-letter words in the sentence generator

• Longer words in the sentence generator

• Curve is similar• Gamma distribution here:

f(x)~x1.50.45x

1

10

100

1000

10000

100000

1 10

frequ

ency

length in letters

word length


gamma distribution

17

Sentence Length

• Longer sentences in English

• More 2-word sentences in english

• Curve is similar

1

10

100

1000

10000

1 10 100

num

ber o

f sen

tenc

es

length in words

sentence length


18

Neighbor-based Co-occurrence Graph

• Min. cooc. freq=2, min. log likelihood ratio=3.84

• NB-graph is a small world

• Qualitatively, English and sentence generator are similar

• Word generator shows much much less co-occurrences

• Factor 2 in clustering coefficient and number of vertices

0.001

0.01

0.1

1

10

100

1000

10000

1 10 100 1000

nr o

f ver

tices

degree interval

degree distribution


word generatorpower law z=2

English sample

sentence gen.

word gen. random graph (ER)

# of ver. 7154 15258 3498 10000

avg. sht. path

2.933 3.147 3.601 4.964

avg. deg. 9.445 6.307 3.069 7

cl.coeff. 0.2724 0.1497 0.0719 6.89E-4

z 1.966 2.036 2.007 -

19

Formation of Sentences

• Word graph grows and contains the full vocabulary used so far for generating in every time step.

• Random walks starting from BOS always end in EOS.

• Sentence length slowly increases: random walk has more possibilities before finally arriving at the EOS vertex.

• Sentence length is influenced by both parameters of the model: – the word end probability w in the

word generator – the new word probability s in the

sentence generator.

1

10

100

10000 100000 1e+006

avg.

sen

tenc

e le

ngth

text interval

sentence length growth

w=0.4 s=0.08w=0.4 s=0.1

w=0.17 s=0.22w=0.3 s=0.09

x^(0.25);

20

Conclusion

Novel random text model • obeys Zipf‘s law• obeys word length distribution• obeys sentence length• shows similar nb-cooccurrence data

First model that:• produces smooth lexical spectrum without initial letter

probabilities• incorporates notion of a sentence• models word order restrictions

21

Sentence generator at work

Beginning: Q . U . RFXFJF . G . G . U . R . U . RFXFJF . XXF . RFXFJF . U . QYVHA . RFXFJF . R TCW . CV . Z U . G . XXF . RFXFJF . M XXF . Q . G . RFXFJF . U . RFXFJF . RFXFJF . Z U . G . RFXFJF . RFXFJF . M XXF . R . Z U .

Later: X YYOXO QO OEPUQFC T TYUP QYFA FN XX TVVJ U OCUI X HPTXVYPF . FVFRIK . Y TXYP VYFI QC TPS Q UYYLPCQXC . G QQE YQFC XQXA Z JYQPX. QRXQY VCJ XJ YAC VN PV VVQF C XJN JFEQ QYVHA. U VIJ Q YT JU OF DJWI QYM U YQVCP QOTE OD XWY AGFVFV U XA YQYF AVYPO CDQQ TY NTO FYF QHT T YPXRQ R GQFRVQ . MUHVJ Q VAVF YPF QPXPCY Q YYFRQQ. JP VGOHYY F FPYF OM SFXNJJ A VQA OGMR L QY . FYC T PNXTQ . R TMQCQ B QQTF J PVX YT DTYO RXJYYCGFJ CYFOFUMOCTM PQRYQQYC AHXZQJQ JTW O JJ VX QFYQ YTXJTY YTYYFXK . RFXFJF JY XY RVV J YURQ CM QOXGQ QFMVGPQ. OY FDXFOXC. N OYCT . L MMYMT CY YAQ XAA J YHYJ MPQ XAQ UYBX RW XXF O UU COF XXF CQPQ VYYY XJ YACYTF FN . TA KV XJP O EGV J HQY KMQ U .

22

Questions?

Danke sdf sehr gf thank fdgf you g fd tusen sd ee takk erte dank we u trew wel wwd muchas werwe ewr gracias werwe rew merci mille werew re ew ee ew grazie d fsd ffs df d fds spassiva fs fdsa rtre trerere rteetr trpemma eedm

Date post:	16-Dec-2015
Category:	Documents
Upload:	hayden-mould
View:	220 times
Download:	3 times

1 A Random Text Model for the Generation of Statistical Language Invariants Chris Biemann University...

Documents