+ All Categories
Home > Documents > Overview - University of Western Australia · 10/7/2009  · >> c = strncmp(str1, str2, N) %...

Overview - University of Western Australia · 10/7/2009  · >> c = strncmp(str1, str2, N) %...

Date post: 02-Aug-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
7
10/7/09 1 Lecture 20: String Processing Overview Reading lines from text files Processing strings – Breaking a string into tokens – Comparing strings for equality – Converting strings to numerals Reading lines from text files Matlab's implementation of the fscanf function is not really very useful all the values read have to be stored as data of the same type within the returned array. This limitation somewhat defeats the idea of being allowed to specify a format string. The alternative approach to reading a text file is to read each line of the file as a long string, to break the string into tokens (the words, numbers, or punctuation in the string) to process the tokens individually. Reading lines of text Matlab provides two functions for reading lines of text from a file: % Read a line from a file *excluding* the end of % line characters. >> line = fgetl(fid) % Read a line from a file *including* the end of % line characters. >> line = fgets(fid) If the end of file is encountered, -1 is returned.
Transcript
Page 1: Overview - University of Western Australia · 10/7/2009  · >> c = strncmp(str1, str2, N) % Compares the first N % characters of the strings. >> c = strncmpi(str1, str2, N) % Compares

10/7/09

1

Lecture 20: String Processing

Overview •  Reading lines from text files

•  Processing strings – Breaking a string into tokens

– Comparing strings for equality

– Converting strings to numerals

Reading lines from text files •  Matlab's implementation of the fscanf function is

not really very useful –  all the values read have to be stored as data of the same

type within the returned array.

•  This limitation somewhat defeats the idea of being allowed to specify a format string.

•  The alternative approach to reading a text file is

–  to read each line of the file as a long string,

–  to break the string into tokens (the words, numbers, or punctuation in the string)

–  to process the tokens individually.

Reading lines of text • Matlab provides two functions for

reading lines of text from a file: % Read a line from a file *excluding* the end of

% line characters.

>> line = fgetl(fid)

% Read a line from a file *including* the end of

% line characters.

>> line = fgets(fid)

•  If the end of file is encountered, -1 is returned.

Page 2: Overview - University of Western Australia · 10/7/2009  · >> c = strncmp(str1, str2, N) % Compares the first N % characters of the strings. >> c = strncmpi(str1, str2, N) % Compares

10/7/09

2

An example •  For example:

% DISPLAYFILE: A function to display the contents

% of a file.

%

% Usage: displayFile(filename)

%

% Arguments: filename - The name of the file to

% display.

%

% Returns: Nil.

%

% Author: Lecturers

% Date: Continuously Updated

Example (cont.) function displayFile(filename)

[fid, msg] = fopen(filename, 'rt'); error(msg);

line = fgets(fid); % Get the first line from % the file. while line ~= -1 fprintf('%s', line); % Print the line on % the screen. line = fgets(fid); % Get the next line % from the file. end

fclose(fid);

Processing Strings • Text: Chapter 6.2.

• Matlab's string processing functions are also closely modelled on their C equivalents.

Breaking a string into tokens •  Tokens are bits of a string - for example, the numbers, words, and

punctuation. •  Which “bits” are important depends on the context or application.

•  Eg. If you are parsing an English sentence you may just be interested in the words and not the punctuation (such as full stops):

•  Given “Life wasn’t meant to be easy. But should it be this hard?”

•  The tokens of interest are: Life wasn’t meant to be easy But should it be this hard

•  On the other hand, if you are parsing numbers, you may want to keep the full stops:

•  Given “3.245, 7.683, -24.7”

The desired tokens are: 3.247 7.683 -24.7, not: 3 245 7 683 24 7

Page 3: Overview - University of Western Australia · 10/7/2009  · >> c = strncmp(str1, str2, N) % Compares the first N % characters of the strings. >> c = strncmpi(str1, str2, N) % Compares

10/7/09

3

Delimiters •  The things that are used to break up the text into tokens are

called delimiters.

E.g. If the delimiters are “, ” (comma and space) we get

3.247 7.683 -24.7

whereas if they are “, .-” (or all punctuation characters) we get 3 245 7 683 24 7

•  Tokens are typically delimited by spaces, tabs, or commas.

e.g. Given a string 'A string of 5 tokens + three more',

tokens are

'A' 'string' 'of' '5' 'tokens' '+' 'three' 'more'

•  Example: reading and writing text files with Excel...

The strtok function •  The strtok function is used to extract tokens from a string.

•  The general syntax of the strtok function is:

[token, remainder] = strtok(string, delim)

–  The variable string is the string to be "tokenised".

–  Returns the first token delimited by one of the characters in the delim string (the delimiters).

•  If delim is omitted, the tokens are delimited by any white space character by default.

•  After extracting the first token of string, the remainder of the string is returned as the separate string array remainder.

•  The extracted token is stored in the output variable token.

An example •  For example, imagine breaking a line of text into a cell array of tokens:

line = 'A string of 5 tokens + three more'; % Initialise or read a

% line from a file.

token={}; ii = 1;

while any(line) [token{ii}, line] = strtok(line);

% Repeatedly apply the

ii = ii + 1; % strtok function. end

•  The any function returns True if any element of a vector is non-zero.

•  At the end of the while loop, the token cell array contains all the tokens in the string.

>> token =

'A' 'string' 'of' '5' 'tokens' '+' 'three' 'more'

Comparing strings for equality

• Two strings should never be compared for equality using the == operator.

• The == operator will return an error if its two arguments are not of the same length.

•  Even if the strings are the same length, the == operator will return an array of boolean values depending on how individual characters match, not a single boolean answer.

Page 4: Overview - University of Western Australia · 10/7/2009  · >> c = strncmp(str1, str2, N) % Compares the first N % characters of the strings. >> c = strncmpi(str1, str2, N) % Compares

10/7/09

4

String Comparison Functions •  Strings in Matlab should be compared using the following functions:

>> c = strcmp(str1, str2) % Returns True if str1 is

% identical to str2.

>> c = strcmpi(str1, str2) % Compares strings ignoring % case differences.

>> c = strncmp(str1, str2, N) % Compares the first N % characters of the strings.

>> c = strncmpi(str1, str2, N) % Compares the first N % characters of the strings % ignoring case differences.

•  Note that the standard C/Java language functions differ in that they return 0 if the strings are identical, a negative value if the first string is alphabetically less than the second string, and a positive value otherwise.

Converting strings to numerals •  The function str2num will evaluate a string as a

Matlab expression, converting it to a number.

•  For example:

>> x = str2num('21')

x = 21

>> x = str2num('1+2')

x =

3

Str2num and eval •  Note however, that spaces can be significant.

>> x = str2num('1 +2') % This is interpreted as

% an array containing two % numbers: 1 and +2. x = 1 2

•  The eval function is similar, but perhaps more predictable. >> x = eval('1 +2') x = 3

>> x = eval('[1 2]') x = 1 2

Summary • The operations described in this lecture form

the basic building blocks that allow you to process text files:

– Reading values from files.

– Reading lines from files.

– Breaking lines into tokens.

– Comparing strings.

– Converting strings to values.

Page 5: Overview - University of Western Australia · 10/7/2009  · >> c = strncmp(str1, str2, N) % Compares the first N % characters of the strings. >> c = strncmpi(str1, str2, N) % Compares

10/7/09

5

Putting it altogether in one example

• Here is a text datafile that specifies the property of 3 beams.

CITS1005BEAM % Magic ID value % % This is a datafile to define a series of beams % File must start with the appropriate Magic value % Comments are indicated by `%' % Number of beams is specified by the keyword `Nbeams' % followed by a value. % % Each beam definition is started by the keyword 'beam' % Fields of `length', `section_area' and `material' % must be specified, % but can appear in any order

Example (cont.)

Nbeams 3 % Number of beams

beam % Start of beam definition length 100 section_area .5 material steel

beam % Start of beam definition length 8 section_area .04 material wood

beam % Start of beam definition

% Oddly formatted, but valid, beam specification material rubber length 1 section_area .3

Design your solution •  The structure of this file has a couple of features

–  The data file starts with a Magic ID value which is used to uniquely identify the type of data file. Any program written to read these kinds of `beam files‘ should check that the first line starts with this Magic ID. This way the program can reject non beam files.

–  The file contains some keywords that establish the context as to how subsequent groups of data should be read. This allows errors to be detected in the reading process.

•  The reading states we expect are: 1.  Start with expecting to see a Magic ID. 2. Then expect to see Nbeams sepcified. 3. Then expect to see a `beam' keyword. 4. Then expect to have length, section_area and material specified. 5. Then expect to see another `beam' keyword etc, or the end of file.

Operations to use • We use the basic operations of:

– Reading lines from files – Breaking lines into tokens – Processing tokens by string comparison or

conversion into values

% readbeamfile - reads textfile specifying beam structures % % Usage: beam = readbeamfile(filename) % % Returns an array of beam structures each having fields of % `length', `section_area', and `material'

Page 6: Overview - University of Western Australia · 10/7/2009  · >> c = strncmp(str1, str2, N) % Compares the first N % characters of the strings. >> c = strncmpi(str1, str2, N) % Compares

10/7/09

6

Example (cont.) function beam = readbeamfile(filename)

% Define some values to indicate reading states - this allows one % to structure the code logically

waitingForID = 0; % These numbers are arbitrary waitingForNbeams = 1; % - they just have to be unique. waitingForBeam = 2; waitingForData = 3;

% State transitions can only be as follows: % % waitingForID -> waitingForNbeams -> waitingForBeam -> waitingForData -- % ^ / % |______________________________/

[fid, msg] = fopen(filename, 'rt'); error(msg);

Example (cont.) state = waitingForID; % Initialise state beamNo = 0; % Initialise beam index line = fgets(fid); % Get first line while line ~= -1 % While there is still data to read

remainder = line;

while any(remainder) [token, remainder] = strtok(remainder);

if isempty(token) % No tokens left on this line break; % break out of while loop and % go to next line

elseif strncmp(token, '%', 1) % This is a comment, skip what is left break % and go to next line.

elseif state == waitingForID

Example (cont.) if ~strcmpi(token, 'CITS1005BEAM') % Check for file type fclose(fid); error('This is not a CITS1005BEAM data file'); end state = waitingForNbeams; % State transition

elseif state == waitingForNbeams

if strcmpi(token, 'Nbeams') [token, remainder] = strtok(remainder); % Next token is the value Nbeams = str2num(token);

% Allocate memory for struct array beam = struct('length', cell(1,Nbeams), ... 'section_area', cell(1,Nbeams), ... 'material', cell(1,Nbeams));

state = waitingForBeam; % State transition else fclose(fid); error('Unexpected data in file'); end

Example (cont.) elseif state == waitingForBeam

if strcmpi(token, 'beam') beamNo = beamNo + 1; % Increment beam count state = waitingForData; % State transition else fclose(fid); error('Unexpected data in file'); end

elseif state == waitingForData % Fill in the beam data fields

if strcmpi(token, 'length') [token, remainder] = strtok(remainder); % Next token is the value beam(beamNo).length = str2num(token);

elseif strcmpi(token, 'section_area') [token, remainder] = strtok(remainder); % Next token is the value beam(beamNo).section_area = str2num(token);

elseif strcmpi(token, 'material') [token, remainder] = strtok(remainder); % Next token is the value beam(beamNo).material = token;

else fclose(fid); error('Incomplete beam data, or unexpected data in file'); end

Page 7: Overview - University of Western Australia · 10/7/2009  · >> c = strncmp(str1, str2, N) % Compares the first N % characters of the strings. >> c = strncmpi(str1, str2, N) % Compares

10/7/09

7

if beamSpecified(beam(beamNo)) % We have all the data state = waitingForBeam; % Go onto next beam else state = waitingForData; % Keep looking for data for this beam end

end end

line = fgets(fid); % Get next line end

% Check we got all the beams

if beamNo ~= Nbeams fprintf('Data for %d beams were read, %d were expected', beamNo, Nbeams); end

fclose(fid);

% Internal function to check that all fields of a beam structure have been set

function v = beamSpecified(b) v = ~isempty(b.length) & ~isempty(b.section_area) & ~isempty(b.material);

Example (cont.)


Recommended