+ All Categories
Transcript
Page 1: Overview - University of Western Australia · 10/7/2009  · >> c = strncmp(str1, str2, N) % Compares the first N % characters of the strings. >> c = strncmpi(str1, str2, N) % Compares

10/7/09

1

Lecture 20: String Processing

Overview •  Reading lines from text files

•  Processing strings – Breaking a string into tokens

– Comparing strings for equality

– Converting strings to numerals

Reading lines from text files •  Matlab's implementation of the fscanf function is

not really very useful –  all the values read have to be stored as data of the same

type within the returned array.

•  This limitation somewhat defeats the idea of being allowed to specify a format string.

•  The alternative approach to reading a text file is

–  to read each line of the file as a long string,

–  to break the string into tokens (the words, numbers, or punctuation in the string)

–  to process the tokens individually.

Reading lines of text • Matlab provides two functions for

reading lines of text from a file: % Read a line from a file *excluding* the end of

% line characters.

>> line = fgetl(fid)

% Read a line from a file *including* the end of

% line characters.

>> line = fgets(fid)

•  If the end of file is encountered, -1 is returned.

Page 2: Overview - University of Western Australia · 10/7/2009  · >> c = strncmp(str1, str2, N) % Compares the first N % characters of the strings. >> c = strncmpi(str1, str2, N) % Compares

10/7/09

2

An example •  For example:

% DISPLAYFILE: A function to display the contents

% of a file.

%

% Usage: displayFile(filename)

%

% Arguments: filename - The name of the file to

% display.

%

% Returns: Nil.

%

% Author: Lecturers

% Date: Continuously Updated

Example (cont.) function displayFile(filename)

[fid, msg] = fopen(filename, 'rt'); error(msg);

line = fgets(fid); % Get the first line from % the file. while line ~= -1 fprintf('%s', line); % Print the line on % the screen. line = fgets(fid); % Get the next line % from the file. end

fclose(fid);

Processing Strings • Text: Chapter 6.2.

• Matlab's string processing functions are also closely modelled on their C equivalents.

Breaking a string into tokens •  Tokens are bits of a string - for example, the numbers, words, and

punctuation. •  Which “bits” are important depends on the context or application.

•  Eg. If you are parsing an English sentence you may just be interested in the words and not the punctuation (such as full stops):

•  Given “Life wasn’t meant to be easy. But should it be this hard?”

•  The tokens of interest are: Life wasn’t meant to be easy But should it be this hard

•  On the other hand, if you are parsing numbers, you may want to keep the full stops:

•  Given “3.245, 7.683, -24.7”

The desired tokens are: 3.247 7.683 -24.7, not: 3 245 7 683 24 7

Page 3: Overview - University of Western Australia · 10/7/2009  · >> c = strncmp(str1, str2, N) % Compares the first N % characters of the strings. >> c = strncmpi(str1, str2, N) % Compares

10/7/09

3

Delimiters •  The things that are used to break up the text into tokens are

called delimiters.

E.g. If the delimiters are “, ” (comma and space) we get

3.247 7.683 -24.7

whereas if they are “, .-” (or all punctuation characters) we get 3 245 7 683 24 7

•  Tokens are typically delimited by spaces, tabs, or commas.

e.g. Given a string 'A string of 5 tokens + three more',

tokens are

'A' 'string' 'of' '5' 'tokens' '+' 'three' 'more'

•  Example: reading and writing text files with Excel...

The strtok function •  The strtok function is used to extract tokens from a string.

•  The general syntax of the strtok function is:

[token, remainder] = strtok(string, delim)

–  The variable string is the string to be "tokenised".

–  Returns the first token delimited by one of the characters in the delim string (the delimiters).

•  If delim is omitted, the tokens are delimited by any white space character by default.

•  After extracting the first token of string, the remainder of the string is returned as the separate string array remainder.

•  The extracted token is stored in the output variable token.

An example •  For example, imagine breaking a line of text into a cell array of tokens:

line = 'A string of 5 tokens + three more'; % Initialise or read a

% line from a file.

token={}; ii = 1;

while any(line) [token{ii}, line] = strtok(line);

% Repeatedly apply the

ii = ii + 1; % strtok function. end

•  The any function returns True if any element of a vector is non-zero.

•  At the end of the while loop, the token cell array contains all the tokens in the string.

>> token =

'A' 'string' 'of' '5' 'tokens' '+' 'three' 'more'

Comparing strings for equality

• Two strings should never be compared for equality using the == operator.

• The == operator will return an error if its two arguments are not of the same length.

•  Even if the strings are the same length, the == operator will return an array of boolean values depending on how individual characters match, not a single boolean answer.

Page 4: Overview - University of Western Australia · 10/7/2009  · >> c = strncmp(str1, str2, N) % Compares the first N % characters of the strings. >> c = strncmpi(str1, str2, N) % Compares

10/7/09

4

String Comparison Functions •  Strings in Matlab should be compared using the following functions:

>> c = strcmp(str1, str2) % Returns True if str1 is

% identical to str2.

>> c = strcmpi(str1, str2) % Compares strings ignoring % case differences.

>> c = strncmp(str1, str2, N) % Compares the first N % characters of the strings.

>> c = strncmpi(str1, str2, N) % Compares the first N % characters of the strings % ignoring case differences.

•  Note that the standard C/Java language functions differ in that they return 0 if the strings are identical, a negative value if the first string is alphabetically less than the second string, and a positive value otherwise.

Converting strings to numerals •  The function str2num will evaluate a string as a

Matlab expression, converting it to a number.

•  For example:

>> x = str2num('21')

x = 21

>> x = str2num('1+2')

x =

3

Str2num and eval •  Note however, that spaces can be significant.

>> x = str2num('1 +2') % This is interpreted as

% an array containing two % numbers: 1 and +2. x = 1 2

•  The eval function is similar, but perhaps more predictable. >> x = eval('1 +2') x = 3

>> x = eval('[1 2]') x = 1 2

Summary • The operations described in this lecture form

the basic building blocks that allow you to process text files:

– Reading values from files.

– Reading lines from files.

– Breaking lines into tokens.

– Comparing strings.

– Converting strings to values.

Page 5: Overview - University of Western Australia · 10/7/2009  · >> c = strncmp(str1, str2, N) % Compares the first N % characters of the strings. >> c = strncmpi(str1, str2, N) % Compares

10/7/09

5

Putting it altogether in one example

• Here is a text datafile that specifies the property of 3 beams.

CITS1005BEAM % Magic ID value % % This is a datafile to define a series of beams % File must start with the appropriate Magic value % Comments are indicated by `%' % Number of beams is specified by the keyword `Nbeams' % followed by a value. % % Each beam definition is started by the keyword 'beam' % Fields of `length', `section_area' and `material' % must be specified, % but can appear in any order

Example (cont.)

Nbeams 3 % Number of beams

beam % Start of beam definition length 100 section_area .5 material steel

beam % Start of beam definition length 8 section_area .04 material wood

beam % Start of beam definition

% Oddly formatted, but valid, beam specification material rubber length 1 section_area .3

Design your solution •  The structure of this file has a couple of features

–  The data file starts with a Magic ID value which is used to uniquely identify the type of data file. Any program written to read these kinds of `beam files‘ should check that the first line starts with this Magic ID. This way the program can reject non beam files.

–  The file contains some keywords that establish the context as to how subsequent groups of data should be read. This allows errors to be detected in the reading process.

•  The reading states we expect are: 1.  Start with expecting to see a Magic ID. 2. Then expect to see Nbeams sepcified. 3. Then expect to see a `beam' keyword. 4. Then expect to have length, section_area and material specified. 5. Then expect to see another `beam' keyword etc, or the end of file.

Operations to use • We use the basic operations of:

– Reading lines from files – Breaking lines into tokens – Processing tokens by string comparison or

conversion into values

% readbeamfile - reads textfile specifying beam structures % % Usage: beam = readbeamfile(filename) % % Returns an array of beam structures each having fields of % `length', `section_area', and `material'

Page 6: Overview - University of Western Australia · 10/7/2009  · >> c = strncmp(str1, str2, N) % Compares the first N % characters of the strings. >> c = strncmpi(str1, str2, N) % Compares

10/7/09

6

Example (cont.) function beam = readbeamfile(filename)

% Define some values to indicate reading states - this allows one % to structure the code logically

waitingForID = 0; % These numbers are arbitrary waitingForNbeams = 1; % - they just have to be unique. waitingForBeam = 2; waitingForData = 3;

% State transitions can only be as follows: % % waitingForID -> waitingForNbeams -> waitingForBeam -> waitingForData -- % ^ / % |______________________________/

[fid, msg] = fopen(filename, 'rt'); error(msg);

Example (cont.) state = waitingForID; % Initialise state beamNo = 0; % Initialise beam index line = fgets(fid); % Get first line while line ~= -1 % While there is still data to read

remainder = line;

while any(remainder) [token, remainder] = strtok(remainder);

if isempty(token) % No tokens left on this line break; % break out of while loop and % go to next line

elseif strncmp(token, '%', 1) % This is a comment, skip what is left break % and go to next line.

elseif state == waitingForID

Example (cont.) if ~strcmpi(token, 'CITS1005BEAM') % Check for file type fclose(fid); error('This is not a CITS1005BEAM data file'); end state = waitingForNbeams; % State transition

elseif state == waitingForNbeams

if strcmpi(token, 'Nbeams') [token, remainder] = strtok(remainder); % Next token is the value Nbeams = str2num(token);

% Allocate memory for struct array beam = struct('length', cell(1,Nbeams), ... 'section_area', cell(1,Nbeams), ... 'material', cell(1,Nbeams));

state = waitingForBeam; % State transition else fclose(fid); error('Unexpected data in file'); end

Example (cont.) elseif state == waitingForBeam

if strcmpi(token, 'beam') beamNo = beamNo + 1; % Increment beam count state = waitingForData; % State transition else fclose(fid); error('Unexpected data in file'); end

elseif state == waitingForData % Fill in the beam data fields

if strcmpi(token, 'length') [token, remainder] = strtok(remainder); % Next token is the value beam(beamNo).length = str2num(token);

elseif strcmpi(token, 'section_area') [token, remainder] = strtok(remainder); % Next token is the value beam(beamNo).section_area = str2num(token);

elseif strcmpi(token, 'material') [token, remainder] = strtok(remainder); % Next token is the value beam(beamNo).material = token;

else fclose(fid); error('Incomplete beam data, or unexpected data in file'); end

Page 7: Overview - University of Western Australia · 10/7/2009  · >> c = strncmp(str1, str2, N) % Compares the first N % characters of the strings. >> c = strncmpi(str1, str2, N) % Compares

10/7/09

7

if beamSpecified(beam(beamNo)) % We have all the data state = waitingForBeam; % Go onto next beam else state = waitingForData; % Keep looking for data for this beam end

end end

line = fgets(fid); % Get next line end

% Check we got all the beams

if beamNo ~= Nbeams fprintf('Data for %d beams were read, %d were expected', beamNo, Nbeams); end

fclose(fid);

% Internal function to check that all fields of a beam structure have been set

function v = beamSpecified(b) v = ~isempty(b.length) & ~isempty(b.section_area) & ~isempty(b.material);

Example (cont.)


Top Related