+ All Categories
Home > Documents > 1 Strings CopyWrite D.Bockus. 2 Strings Def: A string is a sequence (possibly empty) of symbols from...

1 Strings CopyWrite D.Bockus. 2 Strings Def: A string is a sequence (possibly empty) of symbols from...

Date post: 13-Jan-2016
Category:
Upload: ruth-price
View: 220 times
Download: 0 times
Share this document with a friend
36
1 Strings CopyWrite D.Bockus
Transcript
Page 1: 1 Strings CopyWrite D.Bockus. 2 Strings Def: A string is a sequence (possibly empty) of symbols from some alphabet. What do we use strings for? 1) Text.

1

Strings

CopyWrite D.Bockus

Page 2: 1 Strings CopyWrite D.Bockus. 2 Strings Def: A string is a sequence (possibly empty) of symbols from some alphabet. What do we use strings for? 1) Text.

2

Strings

• Def: A string is a sequence (possibly empty) of symbols from some alphabet.

• What do we use strings for? 1) Text processing. Word processing.

2) Grammatical Structure of Languages.

3) Searching, String Sequences

Page 3: 1 Strings CopyWrite D.Bockus. 2 Strings Def: A string is a sequence (possibly empty) of symbols from some alphabet. What do we use strings for? 1) Text.

3

String Example 1• E.g. Java's "for" statement. (simplistic view)

for (initialization ; condition; increment) u v w

– Where a “for” statement breaks down into ‘for (u;v;w)’.

– We can then define each part:• u » identifier = constant

• v » identifier relational_operator value

• w » identifier++

• In this context we can also define a while loop as:– while(v)

• Deterministic Context Free Languages (programming languages) are defined by breaking rules down into sub-rules, etc.

Page 4: 1 Strings CopyWrite D.Bockus. 2 Strings Def: A string is a sequence (possibly empty) of symbols from some alphabet. What do we use strings for? 1) Text.

4

Strings Example 2

– Genetic Coding:

– aab cd aab d

Searching for, and matching codes, leads into graph theory. a b b b a a a b b a a c c c d d d a b f g a b f g b bd d

s1s2 s3 s4

Page 5: 1 Strings CopyWrite D.Bockus. 2 Strings Def: A string is a sequence (possibly empty) of symbols from some alphabet. What do we use strings for? 1) Text.

5

String Example 3

• Compression• Converting a large volume of symbols into a

smaller format.– Huffman Coding

– LZW compression.

Page 6: 1 Strings CopyWrite D.Bockus. 2 Strings Def: A string is a sequence (possibly empty) of symbols from some alphabet. What do we use strings for? 1) Text.

6

Basics

Given a string v: The length of v can be expressed:

1) |v| = magnitude of v

2) length (v).

– Empty strings v = ' ' or v =

• There are 5 common operations that may be performed on strings.– Insertion, Deletion, Substitution, Concatenation,

Comparison.

Page 7: 1 Strings CopyWrite D.Bockus. 2 Strings Def: A string is a sequence (possibly empty) of symbols from some alphabet. What do we use strings for? 1) Text.

7

Insertion & Deletion• Insertion

k = ac where

a = (a1, a2 .. am)

c = (c1, c2 .. cn)

insert b = (b1, b2 ... bp) between ac

k = abc

= a1, a2 … am, b1, b2 … bp, c1, c2 … cn

|k| = m + p + n

• Deletion

k = abc

delete c

k = ab

Page 8: 1 Strings CopyWrite D.Bockus. 2 Strings Def: A string is a sequence (possibly empty) of symbols from some alphabet. What do we use strings for? 1) Text.

8

Substitutionk = u where & maybe null, i.e. || = 0 or || = 0

Search for u & replace with v.

k = v Notice this same operation can be accomplished

with a deletion and insertion.k = u

Delete uk =

Insert vk = v

Note: |u| does not have to equal |v|; |k| before does not have to equal |k| after.

Page 9: 1 Strings CopyWrite D.Bockus. 2 Strings Def: A string is a sequence (possibly empty) of symbols from some alphabet. What do we use strings for? 1) Text.

9

Concatination

This is the joining of 2 strings a & b.c = a + b

So if a = (a1, a2 .. am) & b = (b1, b2 ... bn)

Then c = (a1, a2 .. am, b1, b2 ... bn)

– Note: concatenation may be performed with insertion, i.e. insert b at end of a, or substitution.

• a where is null.

• substitute for b.

|c| = m + n

Page 10: 1 Strings CopyWrite D.Bockus. 2 Strings Def: A string is a sequence (possibly empty) of symbols from some alphabet. What do we use strings for? 1) Text.

10

Comparison

– Compare a & b to see if one of the following is true.1) a < b

2) a = b

3) a > b

• 1) a is less then b if a lexicographical comparison is performed on each element of a & b.

• Until the first ak < bk is true.

a b

a1 b1 a1 = b1

a2 b2 a2 = b2

a3 b3 a3 < b3

a4 b4 a4 = b4

b5 (a5 = ) < b5

a3 < b3So, a < b

Page 11: 1 Strings CopyWrite D.Bockus. 2 Strings Def: A string is a sequence (possibly empty) of symbols from some alphabet. What do we use strings for? 1) Text.

11

Comparison Cont...

– Note: a3 < b3 is the first instance where an element in a differs from b.

a < b.

– If a3 = b3 then a is still less then b because |a| < |b|. Can think of having a value of - for comparison purposes.

• 2). For a = b the following must be true.• |a| = |b|

– and • ak = bk k

• 3). a > b, opposite of (1).– Or b < a

Page 12: 1 Strings CopyWrite D.Bockus. 2 Strings Def: A string is a sequence (possibly empty) of symbols from some alphabet. What do we use strings for? 1) Text.

12

String Representations

• Consider the string "L1 CMPR BANANAS WATERMELLONS 12”

• There are 6 ways to represent strings in storage noting that 3 criteria must be kept in mind

– Storage Efficiency (1:1 packing ratio)

– Ease of Lookup (Searching)

– Ease of Modification• Insertion

• Deletion

Page 13: 1 Strings CopyWrite D.Bockus. 2 Strings Def: A string is a sequence (possibly empty) of symbols from some alphabet. What do we use strings for? 1) Text.

13

Fixed Length Strings

L 1C M P RB A N A N A SW A T E R M E L O N S1 2

– Adv: Ease of Modification

– Dis: Storage Efficiency due of wasted space at end of short strings.

Page 14: 1 Strings CopyWrite D.Bockus. 2 Strings Def: A string is a sequence (possibly empty) of symbols from some alphabet. What do we use strings for? 1) Text.

14

Var Strings

– Adv: Easier to look up strings, we already have the length.

– Dis: Still wastes space.

2 L 14 C M P R7 B A N A N A S

11 W A T E R M E L O N S2 1 2

Page 15: 1 Strings CopyWrite D.Bockus. 2 Strings Def: A string is a sequence (possibly empty) of symbols from some alphabet. What do we use strings for? 1) Text.

15

Count Delimited

– Adv: Very efficient in space usage, Lookup is not bad.

– Dis: Modification is hard , Replacing a string must be same length or readjustment of array is needed.

02 L1 04 CMPR 07 BANANAS 11 WATERMELLONS 02 12

Page 16: 1 Strings CopyWrite D.Bockus. 2 Strings Def: A string is a sequence (possibly empty) of symbols from some alphabet. What do we use strings for? 1) Text.

16

Indexed List

– Adv: Good Storage and Search capabilities

– Dis: Modification is poor

Strategies include: always adding new strings and never reclaiming space except during a repack.

1 2 3 4 5 ...2 4 7 11 21 3 7 14 25

L1 CMPR BANANAS WATERMELLONS 12

Page 17: 1 Strings CopyWrite D.Bockus. 2 Strings Def: A string is a sequence (possibly empty) of symbols from some alphabet. What do we use strings for? 1) Text.

17

Linked List

– Adv: Modification is simple pointer manipulation.

– Dis: Storage overhead. • one address per character

– Note: Lookup can be improved by adding additional length field to table or by imploring a hash function.

Linked List

2 1

4 2

7 3

11 4

2 5

L 1

C M P R

Page 18: 1 Strings CopyWrite D.Bockus. 2 Strings Def: A string is a sequence (possibly empty) of symbols from some alphabet. What do we use strings for? 1) Text.

18

Blocked Linked List

– Adv: Better storage then linked list. • More characters per node

– Note: A trade off between dealing with single characters and blocks of characters during modification.

• Note: If modification is not required then methods such as indexed lists are quite useful. Applications include symbol tables in compilers.

Page 19: 1 Strings CopyWrite D.Bockus. 2 Strings Def: A string is a sequence (possibly empty) of symbols from some alphabet. What do we use strings for? 1) Text.

19

Implementation

– In most cases a variable length string structure is desirable. i.e. the most versatile.

– Consider a string type as:

String {int size;char data[];

}– Java declares string objects with methods to determine

length and other attributes.

– Declaring Variables:

String S1, S2

Page 20: 1 Strings CopyWrite D.Bockus. 2 Strings Def: A string is a sequence (possibly empty) of symbols from some alphabet. What do we use strings for? 1) Text.

20

Basic Functions

s.length(); -- Returns the length of S1

• Other Usefull functions– String s.concat(String t);– String new String(s);– String s.substring(int i);– int indexOf(String t, int index);

See Java api.

Page 21: 1 Strings CopyWrite D.Bockus. 2 Strings Def: A string is a sequence (possibly empty) of symbols from some alphabet. What do we use strings for? 1) Text.

21

Variable Length Coding

Old TreeChar Prob. Bits (MRC) Bits (Fixed) (pi)lg(pi) pi(bits)

a 0.15 3 3 -0.41 0.45e 0.25 2 3 -0.50 0.5i 0.13 3 3 -0.38 0.39

m 0.09 4 3 -0.31 0.36s 0.15 2 3 -0.41 0.3z 0.02 4 3 -0.11 0.08t 0.09 4 3 -0.31 0.36r 0.12 4 3 -0.37 0.48

H(U) 2.81Ave. Redundency

Fixed 3 0.06MRC 2.92 0.04

Page 22: 1 Strings CopyWrite D.Bockus. 2 Strings Def: A string is a sequence (possibly empty) of symbols from some alphabet. What do we use strings for? 1) Text.

22

Huffman Coding Algorithm

1) Collect a history of the frequencies of the characters i.e. determine the probabilities.

2) Arrange the characters in an ordered list (priority queue) based on increasing probabilities (frequency)

3) While (More then 1 node in List) Do i) Remove first 2 Nodes

ii) Combine into a tree and have the tree root represent the sum of the frequencies of the children

iii) Insert into List maintaining proper List order

Page 23: 1 Strings CopyWrite D.Bockus. 2 Strings Def: A string is a sequence (possibly empty) of symbols from some alphabet. What do we use strings for? 1) Text.

23

Variable Length Coding New TreeBased on new tree

Char Prob. Bits (MRC) Bits (Fixed) (pi)lg(pi) pi(bits)a 0.15 3 3 -0.41 0.45e 0.25 2 3 -0.50 0.5i 0.13 3 3 -0.38 0.39

m 0.09 3 3 -0.31 0.27s 0.15 3 3 -0.41 0.45z 0.02 4 3 -0.11 0.08t 0.09 4 3 -0.31 0.36r 0.12 3 3 -0.37 0.36

H(U) 2.81Ave. Redundency

Fixed 3 0.06MRC 2.86 0.02

Page 24: 1 Strings CopyWrite D.Bockus. 2 Strings Def: A string is a sequence (possibly empty) of symbols from some alphabet. What do we use strings for? 1) Text.

24

LZW Compression

• Lempel-Ziv Welch (LZW)• Uses a method of finding the largest known prefix

in a character string. • Typical uses.

– LossLess

– Compressed file can be reconstructed without data loss• GIF, TIFF

• zip & unzip

Page 25: 1 Strings CopyWrite D.Bockus. 2 Strings Def: A string is a sequence (possibly empty) of symbols from some alphabet. What do we use strings for? 1) Text.

25

LZW Compression

• Idea is to build a code table, where codes are added as they are discovered.

• Look at the prefix for a given character.

Page 26: 1 Strings CopyWrite D.Bockus. 2 Strings Def: A string is a sequence (possibly empty) of symbols from some alphabet. What do we use strings for? 1) Text.

26

Compressor Pseudo Code http://marknelson.us/1989/10/01/lzw-data-compression/

STRING = get input characterWHILE there are still input characters DO    CHARACTER = get input character    IF STRING+CHARACTER is in the string table then        STRING = STRING+character    ELSE        output the code for STRING        add STRING+CHARACTER to the string table        STRING = CHARACTER    END of IF

END of WHILE

Page 27: 1 Strings CopyWrite D.Bockus. 2 Strings Def: A string is a sequence (possibly empty) of symbols from some alphabet. What do we use strings for? 1) Text.

27

DeCompressor Pseudo Codehttp://marknelson.us/1989/10/01/lzw-data-compression/

Read OLD_CODEoutput OLD_CODECHARACTER = OLD_CODEWHILE there are still input characters DO    Read NEW_CODE    IF NEW_CODE is not in the translation table THEN        STRING = get translation of OLD_CODE        STRING = STRING+CHARACTER    ELSE        STRING = get translation of NEW_CODE    END of IF    output STRING    CHARACTER = first character in STRING    add OLD_CODE + CHARACTER to the translation table    OLD_CODE = NEW_CODEEND of WHILE

Page 28: 1 Strings CopyWrite D.Bockus. 2 Strings Def: A string is a sequence (possibly empty) of symbols from some alphabet. What do we use strings for? 1) Text.

28

Compressor Example

• Assume we have an alphabet of a and b.

• We start by building a code book initialized to all characters in the alphabet, in this case a and b.

• We can now compress the string:a a a b b b b b b a a b a a b a

Code String

2

3

4

5

6

7

8

9

0

1

a

b

Page 29: 1 Strings CopyWrite D.Bockus. 2 Strings Def: A string is a sequence (possibly empty) of symbols from some alphabet. What do we use strings for? 1) Text.

29

Compressor Example …

a a a b b b b b b a a b a a b a Code String

2

3

4

5

6

7

8

9

0

1

a

b

Output Code

0

Find largest prefix in code book

a a

Add code + next char to code bookFind largest prefix in code book

2

Add code + next char to code booka a b

Find largest prefix in code book

1

Add code + next char to code book b bFind largest prefix in code book

4

Add code + next char to code bookb b b

Find largest prefix in code book

5

Add code + next char to code book

b b b a

Find largest prefix in code book

3

a a b a

Add code + next char to code bookFind largest prefix in code book

7

No more input to compress so stop

Page 30: 1 Strings CopyWrite D.Bockus. 2 Strings Def: A string is a sequence (possibly empty) of symbols from some alphabet. What do we use strings for? 1) Text.

30

De-compressor Example

• We have an encoded string. 0 2 1 4 5 3 7

• To decode we need two things,– knowledge of the alphabet.

– An initialized code book based on the alphabet.

• Headers on say GIF files contain the alphabet information.

• The code book is re-build during de-compression

Code String

2

3

4

5

6

7

8

9

0

1

a

b

Page 31: 1 Strings CopyWrite D.Bockus. 2 Strings Def: A string is a sequence (possibly empty) of symbols from some alphabet. What do we use strings for? 1) Text.

31

De-compressor Example..

• During De-compression a code is read and an attempt is made to find this code in the code book.

• There are two cases:– The code is found in the code book.– The code is not found in the code book.

• Code found:– output the string from found code.– make an entry based on:

previous string + firstChar of current string.

• Not found:– make an entry into the code book based on:

previous string + firstChar of previous string.– output the string of new entry.

Page 32: 1 Strings CopyWrite D.Bockus. 2 Strings Def: A string is a sequence (possibly empty) of symbols from some alphabet. What do we use strings for? 1) Text.

32

De-compressor Example...

• Notice that a code which is not found is a special case:

• E.g. during compression of a a a b b b ….– a is coded to 0, but the compressor now enters aa into

the code book.

– aa is the next code to be used.

– During de-compression, we can guess at this code.

– Text(previous) + FC(previous).

Page 33: 1 Strings CopyWrite D.Bockus. 2 Strings Def: A string is a sequence (possibly empty) of symbols from some alphabet. What do we use strings for? 1) Text.

33

De-compressor Example….

• More formally:– We encounter a string P[…]P[…]PQ.– If P[…] is in the code book and P[…]P is not, then the

compressor outputs P[…] and adds P[…]P to the code book.

– When the de-compressor sees P[…]P it will not of added this code yet.

– We know from the pattern that P[…] is already in the code book and it was the last code encountered, and that P[…]P would normally be added next (during compression).

– So…. We can accurately guess and enter P[…]P into the code book.

• Taken from: http://www.danbbs.dk/~dino/whirlgif/lzw.html

Page 34: 1 Strings CopyWrite D.Bockus. 2 Strings Def: A string is a sequence (possibly empty) of symbols from some alphabet. What do we use strings for? 1) Text.

34

De-compressor Example….

Code String

2

3

4

5

6

7

8

9

0

1

a

b

0 2 1 4 5 3 7

Output Text

a a

Found - No code book entry is made for first code

a

Not Found - Enter Text(previous) + FC(Previous).Output last code entered into code book

a a

Found - Enter Text(previous) + FC(current).

b

a a bNot Found - Enter Text(previous) + FC(Previous).b b

Output last code entered into code book

b b

Not Found - Enter Text(previous) + FC(Previous).

b b b

Output last code entered into code book

b b b

Found - Enter Text(previous) + FC(current).

a a b

b b b a

Not Found - Enter Text(previous) + FC(Previous).

a a b a

Output last code entered into code book

a a b a

No more code to de-compress - STOP

Page 35: 1 Strings CopyWrite D.Bockus. 2 Strings Def: A string is a sequence (possibly empty) of symbols from some alphabet. What do we use strings for? 1) Text.

35

Links

http://www.cs.sfu.ca/cs/CC/365/li/squeeze/

Squeeze Page - Applets dealing with compression Algorithms

http://www.geocities.com/yccheok/lzw/lzw.html

Page 36: 1 Strings CopyWrite D.Bockus. 2 Strings Def: A string is a sequence (possibly empty) of symbols from some alphabet. What do we use strings for? 1) Text.

36

Finite State Machine for KMP Pattern 1010110

011

0 1 2 3 4 5 6 7

0

0

0

0

1

1

1 0111 00


Recommended