Segmentation of Music Primitives - bmva. · PDF fileSegmentation of Music Primitives ......

Segmentation of Music Primitives

K.C.Ng and R.D.BoyleDivision of Artificial Intelligence, School of Computer Studies,

The University of Leeds,Leeds LS2 9JT, United Kingdom

Abstract

In this paper, low-level knowledge directed pre-processing and segmenta-tion of music scores are presented. We discuss some of the problems thathave been overlooked by existing research but have proved to be majorobstacles for robust optical music recognisers [1] to help entering musicinto a computer, including sub-segmentation of interconnected primitivesand identification of nonstraight stave lines, and present solutions to theseproblems. We conclude that, with knowledge, a significant improvementin low-level segmentations can be achieved.

1 Introduction

Computers are being increasingly used for musical applications, and numerousavailable musical software package require a machine representation of musicto perform their task. Currently, input methods are very time consuming andrequire some musical knowledge. Optical musical score recognition, especiallyif able to analyze handwritten scores, would be an interesting and time savinginput technique. This is similar to the job of an 'Engraver' who reads a hand-written music score and with specific knowledge transforms it into an engravedmusic score for printing [5].

This visual problem might seem simple, since writing is black on whitepaper. Unfortunately, many of the symbols are highly interconnected (Figure1). Only with musical knowledge can the meaningful figures be discerned.

Figure 1: A number of features which are interconnected.

Interpreting the handwritten notation of a composer is even more difficultbecause sloppy handwriting results in unclosed and ambiguous note heads,stems not attached to note heads, beams looking similar to slurs, with somephrase marks tying a number of stems and joining up all these features.

The overall target process is thus as follow :

BMVC 1992 doi:10.5244/C.6.49

473

• A score is scanned optically, and the digitised image fed into the computeras a raster image.

• The computer performs low-level processing; thresholding and deskewing.

• Segmentation; locate and erase the staves and decompose any compositefeatures until they are recognisable as a primitive.

• Recognition; classified a primitive feature, and

• output it in some appropriate representation, of which standard MIDIfile [6] is perhaps the most popular, being understood by most currentlyavailable software, although it is not ideal as an internal representationduring processing.

In this paper, we concentrate on the issue of low level primitive segmenta-tion, especially confusion introduced by slurs, ties, phrase marks and beams.This problem has received attention before - in particular, [4] notes that mostmusical score recognition is amenable to attack by examining vertical and hor-izontal projection histograms of suitably chosen windows of an image. Ourapproach is based on this observation, and our results tend to confirm theview that such simply derived observations carry a wealth information (oftensufficient) in this domain.

2 Pre-processing

2.1 Thresholding

The continuous-tone image from the scanner is converted into a binary (blackand white) image. For this purpose the Iterative Threshold Selection Methodof Ridler and Calvard [10] with Lloyd's modification [8] is used.

Figure 2: An example input grey tone digital image and the thresholded output.

The threshold method works well and the threshold value converges, usuallyin less than eight iterations. It produces clean output with little or no noise(Figure 2). Usually, such noise is just some isolated points and can be identifiedeasily.

2.2 Skew correction

At this stage, musical information is contained in the black pixels. In musicscores, horizontal alignment is an important property and an indispensableclue during recognition, permitting projection techniques to be used to detectfeature position. Symbols are aligned on a five lines (stave lines) staff, andresults will only be correct if the staves are horizontal at acquisition time. In

474

practice there is always a slight skew (characteristically less than two degrees)when the image is captured. We adopt a modification of the approach used byMartin and Bellissant [9] to find the skew angle.

First, we compute a measure of horizontality at some range of possible skewangles with a fixed step, for example, —5° to +5° at 0.1° intervals. The middlecolumn of the image, which is the most likely to cut through any horizontal line,is scanned from the top to the bottom row. When we find a foreground (black)pixel, a line template, computed using Bresenham's discrete line algorithm [2],at each possible skew is offered to it, and the number of foreground pixelswhich fall on the template line counted. The count for each angle is thenaccumulated for each row, after which the angle with the highest count providesthe skew angle. Frequently, there is no clear single peak to determine the skew

Pixel count

Blurred version

Angle

possible skew

Figure 3: Skew measurement.

(Figure 3). In this eventuality, a Gaussian blur is applied to the responses withsufficiently high cr to make it unimodal - the resulting mode is then acceptedas the skew.

Figure 4: An input image with skew, and its deskewed version.

The original 256 grey-level image is then rotated , after which the thresh-olding algorithm is reapplied. This deskewing process is certainly worthwhile(Figure 4). Deskewed images have more even and smooth stave thickness andlighten the effort of locating and removing them at a later stage. Often, theresults are not pixel-perfect, but are at least good enough.

3 Locating and erasing the staves

The stave is the fundamental element of a musical score. A note head by itselfmay represent the duration of a note, but it has to associate with the staveline to gain its pitch. A staff is a group of five stave lines which are equallyspaced, the stave line thickness and the distant between two stave lines being

475

important parameters at all later stages of recognition. Once we know theposition of these lines, they become distractions when we try to recognise thefeatures which have been engraved on or around the staff. Hence, the stave mustfirst be located and measured before erasing it to isolate the musical features.Unfortunately, the stave lines often pass through musical symbols, and so theymust be erased selectively in order not to disconnect these symbols.

A histogram of the horizontal projection is generated in which graphs offive equally spaced peaks are usually clear. If they are not (due, for example,to inter-stafF text) this pattern can be observed by a suitable blurring of thishistogram. From this information, the stave line position, average line widthand the space between lines are extracted and recorded.

Centre columncentre point

1 lill

=̂ i *rrr

stave lines

Figure 5: Tracing the stave line.

In practice, the stave lines are not found to be completely straight or ofeven thickness, causing the parameters for each stave line to differ considerably.Thus we have to repeat the process for each stave line.

For each stave line, we start from its centre column at its central (vertical)position and trace it right and left (Figure 5). Assuming that the line is straightand horizontal, for each column the pixel in the indicated vertical position istaken, and the height of the foreground feature in that column of which it ispart is recorded. The distribution of these heights provides a clear mode whichprovides the 'usual' stave line thickness. Since in practice the line thickness isseen to fluctuate, we choose as the highest acceptable width (WL), the pointafter this mode with the distribution gradient almost equal to zero (Figure 6).

Each stave line is then scanned again; commencing from the centre column,the predicted centre pixel is inspected. If it is background, the closest fore-ground pixel (within the range of 2 * WL) to the predicted position in thatcolumn is located. If there is no foreground pixel in that range, we assume thatthe line may be disconnected and go on to scan the next column. If the heightof the connected foreground strip so defined does not exceed WL, its verticalposition centre is recorded as the correct best estimate of the line centre posi-tion and the strip is deleted. If the strip exceeds WL in height, it is allowed toremain and the centre estimate not amended. This procedure is then iterativelyrepeated in neighbouring columns.

476

Verticalhistogram,

stave line thickness

across the imagestave line thickness

Frequency

WL Thickness in pixel

Figure 6: Determining the line thickness.

The output is good but not perfect (Figure 7). Features that were engravedon the staff inherit some noise from the stave lines when they are removed, whilethin and long feature such as slurs, ties and phrase marks tend to be discon-nected. At this stage, perfect erasing is very unlikely, but these imperfectionsare not important as they can be overcome during recognition.

Figure 7: An example of stave line removal.

4 Segmentation

The primitives with which the composer deals (crochets, rests, slurs etc.) areoften not as simple for an automatic system to identify as their componentparts. For this reason, henceforward we refer to 'primitives' as graphical prim-itives on the page which sometimes (but not always) do not correspond to the

477

musical primitives which the destination representation will expect. In par-ticular, stems will be regarded as features to be recognised independently ofthe note head to which they are connected, and beams connecting quavers (forexample) will likewise be regarded as primitives in their own right. Given ac-curate recognition of these low level primitives, a reliable reconstruction of thederived musical symbols should be straightforward.

When the stave lines are erased, the image will be left with blocks of con-nected foreground pixels which may be recognisable primitives, such as noteheads or stems, or composite objects, such as a group of four semi-quavers, ornoise or part of a stave. These are inspected sequentially to determine whetherthey are primitive features or need further segmentation.

4.1 Primitive sub-segmentation

From the object segmentation, if the object is too 'large' as a primitive featurerelative to the staff, it must be a composite made up of a number of connectedprimitive features (Figure 8).

0Figure 8: Examples composite objects.

In practice, the connections are frequently straight lines (beams) or curves(phrase marks, slurs, ties) which cut through the other note stems or connectednote heads. Within such composites, a sudden change in vertical projectionhistogram usually suggests a possible junction point of two separate features.

In a similar application (separation of merged characters during OCR), Ka-han et al. [7] observe that maxima in the absolute value of the second differenceof the projection is a good indicator of these positions, and that, since 'breakpoints' may be expected to be thin, the ratio of this difference to the projec-tion height is a better measure still. Consequently, we evaluate the measure(V(x — 1) — 2V(x) + V(x + l))/V(x) across the vertical projection V{x), (Figure9).

When the horizontal position (X) with the maximum value of this functionis found, we assume that this is a junction point at which the object may beseparated into two or more smaller features which are connected by a possiblylong and relatively thin feature. Instead of just separating the object into two,we attempt to trace and extract the connecting feature. Starting from X,we trace to its left and right until the image boundary is met, or there is noforeground connected pixel ahead. First, find the centre position (interpretedas above) of the connector at X and get its thickness. For the next columnto trace, the first guess of the centre will be that of the preceding one; if thisprediction is background, 8-neighbour connectivity is used to find any possibleimmediately connected pixel - this may occur during a sharp turning point ona curve. If the thickness of the column is less than half of the space betweentwo stave lines, the foreground pixels which connect with the centre pixel aremarked as an unambiguous part of a connecting feature; otherwise the columnmust be shared by two closely neighbouring features and is preserved.

478

•~1

JLL illFigure 9: A composite feature, its vertical projection and the ratio of seconddifference to the projection.

The features we are tracing are either linear (in the case of beams) orapproximately quadratic (in the case of slurs, phrase marks and ties). After asuitable number of columns (characteristically 10) have been processed, we can,via a least square estimate, fit a polynomial y = a + bx + ex2 to the observedcentre points which is used to predict the likely feature position in the nextcolumn. This permits a sufficiently accurate prediction of feature position'through' objects such as stems within which accurate measurements cannot bemade. Figure 10 shows that the connector, in this case a long slur, was identifiedand when we separate the slur, other primitives are not disturbed. Figure 11shows the separation of a phrase mark joined with a note head and an accentsign. Notice that some of the estimated centre points of the segmented phrase

Figure 10: Connector was identified (thin line).

mark are not continuous with their neighbours; this is due to the least squaresestimate being insufficiently accurate and falling into background. When thishappens, we try to reuse the previous centre position. It is possible that thisproblem would be solved by a higher order curve approximation.

This process is repeated until the output sub-segment is a possible primi-tive feature. The termination criteria is the feature having density within its

479

Figure 11: The sub-segmentation routine, segment out the connector.

bounding box higher than 75%, or being recognisable as a basic primitive suchas a note head.

This works very well for phrase marks, slurs, ties and beams if a good breakpoint can be identified. In practice, noise by the side of a stem or bar line maybe indicated as the break point; this happens rarely, and is identifiable from itssize (very small relative to stave line thickness) and we may simply continuethe process.

For vertically connected features, such as a chord, we try to apply the sametechnique to the horizontal histogram, but the response is not so clear. Bymaking use of the knowledge of the inter-stave line distance and the estimatedbreak points we can deduce good estimates of the location of note heads; pos-sible break points which fall on a stave line or halfway between two stave linesare likely candidates (Figure 12). A complete and robust segmentation of suchconnected features is likely to require fuller interaction with the recognitionphase, and suitable feedback, and this represents work in hand.

tFigure 12: Example input with some estimated break points and its originalstaff position. The three strong peaks are in 'good' positions.

5 ConclusionIn this paper, we have discussed some potential problems encountered in theearly processing of musical scores, and proposed some solutions to them. Wehave chosen to interpret symbols at the most primitive of levels, and haveattempted to segment out such primitives from features which are often highlyinterconnected, and have demonstrated success at extracting long connectingfeatures such as phrase marks, ties, slurs and beams, leaving the primitivesisolated for a subsequent recogniser to work on. Features closely connectedin a vertical direction respond to a similar approach, exploiting knowledgeof the score geometry. When this approach is combined with a higher-levelprocess providing recognition, we expect to be able to take advantage of themusical syntax [3] to verify the identity of features, or to provide feedback toquestionable segmentation, thereby making the whole system very robust andreliable.

480

References[1] M Brown A Clarke and M Thome. Problems to be faced by developers of

computer based automatic music recognisers. In International ComputerMusic Conference, pages 345-347, 1990.

[2] J E Bresenham. Algorithm for computer control of a digital plotter. IBMSystems Journal, 4(l):25-30, 1965.

[3] H Fahmy and D Blostein. A graph grammar for high-level recognition ofmusic notation. In First International Conference Document Analysis andRecognition, pages 70-78, September 1991. Sep 30 - Oct 2.

[4] I Fujinaga. Optical music recognition using projections. Master's thesis,Ma McGill University, 1988.

[5] W Gamble. Music Engraving and Printing. Da Capo Press Music ReprintSeries. Da Capo Press, 1923.

[6] International MIDI Association. Standard Musical Instrument Digital In-terface Files 1.0, July 1988.

[7] S Kahan, T Pavlidis, and H S Baird. On the recognition of printed char-acters of any font and size. IEEE Transactions on Pattern Analysis andMachine Intelligence, PAMI-9(2):274-288, March 1987.

[8] D E Lloyd. Automatic target classification using moment invariants ofimage shapes. Technical Report RAE IDN AW126, Farnborough, U.K.,December 1985.

[9] P Martin and C Bellissant. Low-level analysis of music drawing images. InFirst International Conference Document Analysis and Recognition, pages417-425, September 1991. Sep 30 - Oct 2.

[10] T W Ridler and S Calvard. Picture thresholding using an iterative selec-tion method. IEEE Transactions SMC, 8(8):630-632, August 1978.

Date post:	11-Mar-2018
Category:	Documents
Upload:	phamkhuong
View:	215 times
Download:	2 times

Segmentation of Music Primitives - bmva. · PDF fileSegmentation of Music Primitives ......

Documents