Constraints imposed by programme content on design, delivery and use of interactive media

HUMAN-COMPUTER INTERACTION

Constraints imposed by programme content on design, delivery and use of interactive media

D.R. Clark, PhD

Indexing terms: Human-computer interaction, Image processing, Multimedia systems, Video systems

Abstract: Systems based on digital representations of images are explored and analysed in the light of the sort of demands that interactive programmes of worthwhile complexity will place on them. By separating the procesdes of design from delivery, important advantages can be secured for these two quite dissimilar processes. The special advantages of analogue representations of certain types of data are highlighted and the problems of the representation of meaning are discussed in connection with the management and indexing of essential pictorial imagery. The interactive nature of programmes devised around still and moving images poses problems not usually encountred in conventional computing, and the consideration of these problems can best be carried out as an extension of human-computer interaction research.

1 Introduction

Now that there seems to be a commercial future for interactive media systems, the battle is on to secure the tech- nology by means of whch interactive programmes are created and delivered. In the rush to gain a foothold, some key aspects of the real nature of interactive media systems have been overlooked. In particular, aspects such as the nature of the platform, the ways in which data are to be stored and delivered, and the language in which such systems are to be constructed, have not been sub- jected to sufficient scrutiny and analysis from the point of view of the demands of sounds and pictures.

But more importantly, the relation of the content of the programmes to the form of the mechanical systems used to manipulate them has been insuffciently considered. In place of an ab initio study there has been a too rapid acceptance of techniques currently in use in computing, and too little thought has been given to the essential nature and practical utility of sounds and images and the demands that their presentation makes on current computing strategies.

2 Separating design from delivery

The task of designing an interactive presentation is quite different from the task of delivering it to the user. As so few of the interactive programmes created thus far have

0 IEE, 1994 Paper 96438 (C3, first received 6th November, 1992 and in revised form 19th May 1993 The author is at i-Media, 1 Addington Road, London N4 4RP, United Kingdom and is a visiting professor at the Faculty of Design, Middlesex University, United Kingdom

IEE Proc.-Comput. Digit. Tech., Vol. 141, No. 2, March 1994

had any clear purpose, apart from exploiting a particular piece of hardware, this dichotomy has not presented itself too clearly. It is in the games market that this position is explicit: Nintendo, Sega and their derivatives are special- purpose delivery engines for programs designed and developed on far more resourceful machines.

There is a subtle interplay between the exact nature of a machine and the things it can do. No matter how you try, you can not boil an egg in a toaster. Everybody knows that, because we all have a good working knowledge of eggs, water and toasters. Unfortunately, far too few people have a good working knowledge of computers and the things that they can reasonably be expected to do. Even amongst the community of those who should know better, there is a reluctance to deny the possibility that one day (soon) the computer will be able to do it, whatever ‘it’ is. These people have been overwhelmed by the pace of technological advance and have lost the power to reason. In their world everything is possible, if not now, then with the next generation of computers, or perhaps the one after . . . . Such ‘technoptimism’ is dan- gerous and often goes hand in hand with the idea that there is no real need to understand exactly how something works to decide on its feasibility or suitability.

2.1 Fitness for purpose It is often forgotten that a machine that is not bad for everything is no good for anything. There seems to be a fundamental law of nature that says that if you optimise a machine for task X, the more a task resembles not-X, then the worse the machine will be at accomplishing it. This is why you can’t poach eggs in a toaster: the poacher and toaster are optimised for orthogonal tasks. As the tasks performed by computers cease to be mathematical for their own sake, like numerical integration, and become more exploratory, like image processing so the question of fitness for purpose becomes more acute. The tasks associated with the design of interactive programs are quite different from the tasks associated with the delivery of those programs once constructed. This is because, once designed, the resulting program is finite: its range of tasks and operations is precisely known. The design process is far less determinate, since a range of alternatives must be maintained in such a way as to allow each to be changed in a number of ways that can subvert the whole of the currently existing logic. That’s just a convoluted way of saying that in a design system it must be easy to change your mind.

3 Requirements for interactivity

It is easy to confuse the requirements for design and delivery by saying that they must both exploit interactivity. The interactivity in a design system is real, in the

85

delivery system it is illusory. A good example of this is the maze at Hampton Court. This is a pure delivery system, in which the interaction of the walker within the maze is a dynamic process for the walker but the maze itself is static and unchanging. There was a time when that maze did not exist; it had to be both created and walked at the same time to test the suitability of the current state of the paths. In fact, its early stages were not a maze at all, since the thick Yew hedges had not been planted and the paths were only marks on the ground, with the designer imagining the view as he walked the path. At that stage the range of the possible was wide, including being a different size, shape or even in a different place or not existing at all. Now the choice is finite, but no less ‘interactive’, once the rules are accepted, since the maze exists in its finalised form. The attempts to force holes through the hedge are a testament to the difference between design, where changes of this type were legal, and delivery, where they are not.

3.1 Maintenance of diversity The central objective in design systems is the capacity to maintain a wide range of options, a subset of which will become the final program. The constituents of these options can now include still images, moving images, graphics and animations, each with sounds of various kinds. It is useful to list the generic forms of these constituents:

3.1 .I Images: The essential feature of an image is that it is a projection of the real world. There is a one-one cor- respondence between each point in the image and a point in space. The image has been created by a mechanical system, the most straightforward of which is a camera. There is information at all points in an image, since none of the registration is accidental; the question of noise is ever present, but that does not alter the essential point that an image is a mapping of the external world.

By convention, images can be either still or moving. This division is, of course, artificial, since moving images are a procession of individual stills. The rate of procession is chosen to exploit the properties of the visual pathway and the content of ‘stills’ designed to be perceived as motion has to be carefully controlled to sustain the illusion.

Moving images generate in the viewer the presump- tion of accompanying sounds. That the sounds need not be imaged in the way that the external world is imaged is an interesting commentary on the different ways that input from the ears and eyes are processed in the brain. It is also an essential degree of freedom in the film-makers art.

3.72 Graphics: It is hard to find a word to cover all those pictures which are not the result of imaging. The essential feature of these pictures is that they have been constructed. There is no necessary connection between the contents of the image plane and any other entity. These pictures could always be otherwise.

The page of text you are reading is in this sense a graphic. It has been drawn (?pa&). The word ‘graphic’ now covers pictures that have their origins in images, but have been processed in various ways to modify their appearance. Since this modification destroys the corres- pondence between the object and its image, these pictures are no longer images.

Following the techniques for creating the illusion of motion by the serial presentation of appropriate stills,

86

graphics can portray a sense of temporal development. These dynamic graphics are referred to as animations. They too can generate the expectation of sounds.

3.1.3 Sounds: Sounds can be grouped in a similar way, based on the degree of necessity or intrinsic information that they contain in relation to the picture. ‘Lip-synch’ speech and ‘spot effects’ are the film-maker’s names for those sounds that have an essential relation to the image. It is true by definition that graphics can have no necessary sounds in this sense. Any sound panorama to go with a graphic must be constructed, just as the picture has been constructed. This is not to say that the sounds need have no precise relationship to the picture. It is just that the relationship is not a necessary one. It is purely artistic.

3.1.4 Transitions: Since Eisenstein [l] it has been a truism for film makers that the joins between pictures are as important as the pictures themselves. There is nearly 100 years of experience in what can safely be called the grammar of film. The way that one picture follows another is of profound significance in its effect on the viewer. The cut, the mix, the fade; each has its intrinsic effect on the unfolding narrative. There is a subtle armory of devices, both in the broad sweep and in the fine detail, which amount to a vocabulary with which the film- maker creates new worlds in the viewers’ minds [2]. The hallmark of today’s ‘multimedia’ systems is that none of them yet support the manipulation of this vocabulary in the appropriately grammatical ways [3].

3.2 Suspension of disbelief In the case of the delivery system, the prime task is to support the fundamental illusion that permits the viewer to accept the pictures presented as being worthy of serious consideration. The aim is to make the viewer see through the pictures to perceive the underlying idea. The perfect performance is the one that does not appear to be a performance, or, to quote Gene Kelly, ‘If you look like you’re working, then you’re not working hard enough.’ This proposition is at the heart of the difficulties which beset today’s interactive systems. They are clearly inade- quate, but the tasks they have to accomplish are sufli- ciently outside their compass that there is no possibility of them ‘working harder’. They are condemned to look like amateurs at the end of the pier.

4 Representation emulation and simulation

The chief task of any system designed to present pictures to the viewer is to support a particular evocation of reality. Without getting into profound philosophical debate on the nature of reality, for this purpose it is suff- cient to define ‘reality’ as that domain which contains information at all scales. This effectively distinguishes it from any representation, since a representation must have, by definition, a practical limit beyond which there ceases to be representative information. For example, once you have magnified a region of a photograph of a flower beyond a certain point, there is no more information about the object, only the inner workings of the image (e.g. the film grain) which has no connection with the flower whatever.

It seems that there are three levels of portrayal: representation, emulation and simulation. They form a hierarchy based on the degree to which we are expected to suspend disbelief, Fig. 1.

IEE Proc.-Cornput. Digit. Tech., Vol. 141, No. 2, March 1994

pictures and sounds

realism?

under I y i ng model

accurate 7

Fig. 1 tion

Three leuels of portrayal: representation, emulation and simula-

4.1 Representation A representation contains only those features which are relevant to the current task, and the means by which the representation is achieved has no necessary connection with the entity represented. A carefully illustrated diagram of a piece of apparatus might be such a representation.

4.2 Emulation An emulation presents certain features of an object or system in such a way that although the outward appearance of the target is a representation, the actions or responses it produces relate to the real system in a necessary way. An example might be the depiction by graphical means of the growth and decay of an ecosystem: we see no flowers grow or rabbits die but the underlying mathematics is accurate. The representation by graphical means is an emulation because the ecosystem, although it doesn’t look like one, it behaves like one with respect to changes in input or parameters.

4.3 Simulation A simulation attempts the closest approach to the in- vocation of the real world. Not only is the underlying model accurate, but the presentation of the information and the acquisition of input are as lifelike as possible. A flight simulator of the kind used in the certification of pilots is an example. Ideally, simulations require no suspension of disbelief.

4.4 Virtuality An unpleasant oxymoron has crept into the vocabulary of computer graphics, aided and abetted by the sillier magazines, namely ‘virtual reality’. Like ‘unique’, ‘reality’ is unqualifiable: it is either real or it isn’t. A far better term, which is already gaining in popularity, is the neolo- gism ‘virtuality’. Nowhere is the inadequacy of contemporary computing so clearly exposed as in the attempts at simulation where the input permits almost uncon- strained selection of viewpoints into a 3-D dataset. But it is not just a problem of processor speed, as I shall show.

5 Allocation of bandwidth

In our unmediated experience of the world the limiting factor for data acquisition is our own intrinsic processing speed. The ways in which we humans acquire information are ill understood at present. Two examples reveal some of the problems.

Richard Gregory has often pointed out [4] that if we work by visual processes alone, driving a car at more than about 25 mile/h would be impossible: the rate of simulation of the retinal sensors as the scene sweeps across the visual field far exceeds the total rate at which


information can be transmitted down the optic nerve. Driving is an imaginative activity that takes place largely in the theatre of the mind. Data acquisition plays a crucial part, but only as subject to the imagination.

The second example is the way in which our visual system works [SI. That the number of light-sensitive elements in the retina is loo00 times larger than the number of afferent fibres in the optic nerve should give pause for thought; that two of the three frequency-sensitive cone types have almost the same spectral sensitivity and that the experience of colour does not depend on the wave- length of light emitted from the coloured region in many circumstances [6] should also make you reflect on the complexity of the visual process.

The starting point for mechanical systems designed to mediate reality is ill defined. A common factor in all such systems, however, must be that they must present their information in such a way as to be in harmony with the properties of the human perception systems.

5.1 Human visual system The human visual system has two channels, each with different spatial and temporal acuities, which resolve objects into information in a luminance channel and a chrominance channel. These channels are nonlinear in respect to all inputs (as are all biological channels) and an illustration of their threshold contrast performance [7] is shown in Fig. 2. The contours mark the surface of

10

per degree spatial frequency.cycles per degree

a b

Threshold sensitivities of human visual system in contours of Fig. 2 constant (Imx - ImiJ/Iws 4 Luminance channel b Chrominance channel

detectable contrast; the ability to resolve the difference between adjacent entities when presented spatially and temporally. Luminance bars (of alternating intensity I,, and ImiJ that differ by (I,, - ImiJ/I,,,,,, = 0.005 in a five- cycle-per-degree grating can be resolved only if they are flashed at about five times a second. The chrominance channel has its region of maximum sensitivity at a similar temporal frequency but at about l/lOth of the spatial resolution. A useful benchmark is that under optimal conditions, the limit of resolution of the luminance channel, from which the ‘meaning’ of an image appears to be extracted, is about 30 cycles per degree. The bar of Fig. 3 has 37 black and white pairs per inch (from the 72

Fig. 3 Bar of37 black and white pairs per inch, which should read as grey panel beyond 50 inches or so, serves as benchmark in discussions of resolution and definition

87

pixel per inch Mac Plus nudged to match the 300 dot/in. of the Laserwriter). It should read as a grey block beyond 50 in. or so. This will be a useful benchmark in the discussions of resolution and definition.

5.2 Pictures It is important to appreciate that the human visual system seems to perform its major interpretative actions on information provided by the luminance channel [8]. The chrominance acuity is about an order of magnitude less than that for brightness, and there is a further sub- division in that the eye is more sensitive to variations in hue than to variations in saturation.

A number of strategies make the best of the economies to be had from the division of signals destined for the eye into luminance and chrominance. They are all based on the phenomenological fact that an acceptable greyscale can be constructed from the following sum of weighted brightnesses from the primary R, G and B signals of a scene:

Y = 0.30R + 0.59G + 0.1 1B

the R, G and E are primary in the sense that no com- bination of any two will give the third. There are two other orthogonal components that represent the extra information needed to recover R , G and B from this Y; in video they are called the chrominance components [9] U and V (in the USA and Japan they are chosen slightly differently and are called Z and Q). In this representation, hue, the variable used to differentiate between colours, is tan-'(V/U), and saturation, the magnitude of the hue, is

It is the Y-signal that has to be maintained at a resolution that will satisfy the 30 cycle per degree cri- terion at the display. U and V are far less critical, and the key point is that one never sees the R, G and B on the screen: the sensation of colour is derived in an elaborate way in the theatre of the mind, for which much of the data is luminance, some is chrominance but most is imagination.

5.2.1 lmages: As the video signal is ubiquitous, one can use it as a baseline in bandwidth calculations for images of external objects. The visible raster of European video is spanned by 441 600 square pixels in a 4 x 3 array. It is a matter of the greatest good fortune that the dynamic range of the eye is satisfied by eight bits of brightness (provided that they are carefully chosen), so that 0.42 Mbyte are required for TV luminance. The decision for chrominance is more difficult; human requirements suggest that 42 Kbyte should be sufficient, and Pioneer have developed a recordable videodisc system on this basis that produces broadcast-quality pictures. An inter- national standard has been agreed that offers considerable programming convenience but is far in excess of what the eye can see. This standard, CCIR 601 [lo], holds the chrominance at half the resolution of the luminance in the byte sequence U,, Y,, V,, Y,. The half resolution comes from the fact that although each of the chrominance samples is eight bits deep, they are taken half as often and repeated for successive luminance samples (K, K,; K+,, U,,,, K,; K+,, U,+,, K,,,; etc.). These considerations mean that the frame require- ment is 0.84 Mbyte. This is considerably less than the 1.29 Mbyte that storing RGB at eight bits would require.

The frame rate of video is 25 pictures per second. This is dangerously close to the visual flicker threshold for bright pictures. This is the greatest limitation of conven-

(U2 + V2)1'2.

88

tional video, and the device of interlace scanning of half the image at a time now causes more problems than it solves. This caveat notwithstanding, the digital equivalent of analogue video is 22 Mbyte/s. This is a huge rate in current desktop data management terms, but the gain is that real-world events can be shown without appre- ciable impairment if it can be achieved. Special-purpose hardware designed for the broadcasting industry is just now beginning to approach these data rates in more than burst mode.

5.2.2 Graphics: As they have no necessary connection with the real world, graphics can be approached from a slightly different point of view. As the palette of colours is a matter of choice, the question becomes: how many bits are required to provide a sufficient range? Many studies [ 111 over the years have shown that an appropriately- loaded eight-bit lookup table provides a suflicient range, and that for a great many applications 16 carefully chosen colours are sufficient. These considerations suggest that 0.22 Mbyte would be sufficient if fast nybble processing was on offer. Even so, the 5.5 Mbyte/s data rate still overwhelms today's desktop engines, so ani- mation is still something of a problem without specially- designed hardware.

5.3 Sounds Sound is always the poor relation where pictures are con- cerned (nothing has changed since 'Singin' in the Rain'). This may be because, in the normal way of things, eyes take precedence over ears in the human perceptual hierarchy (people who need glasses hear better with them on). A more mechanical reason is that the time-serial data rate for images derived above is considerably larger than that required for sound, so sound is sacrificed in the struggle for moving pictures.

The same logic applied to hearing as was applied to the visual system shows that to fool the ear into thinking that the sounds are images of real-world events requires that the pressure wave at the ear be free from spurious amplitude variations to better than 0.1% and contain frequencies which sustain a good impulse response up to about 4 kHz (the upper limit of musical pitch) [12]. This frequency range requires a bandwidth of 20 kHz to accommodate the harmonics necessary for the impulse response. If the signal is to be sampled such that on reconstitution these high frequencies are recovered, the Nyquist condition requires sampling at 40 kHz, and practical considerations in filter design have led to sampling frequencies close to 44 kHz being adopted for CD-DA and DAT (they are, of course, different). To resolve the amplitude to better than 0.1% requires sampling into 2" (1024) levels. As this is no useful com- bination of bytes and nybbles, the convention is to use 16 bits and damn the expense. To provide stereo sound there is a second identical channel, so that the data rate for a CD digital version of an accurate representation of the real sound world is 176.4 Kbyte/s.

It is worth emphasising that the digital audio data rate is smaller by a factor of 125 than the rate for full-screen full-motion digital video.

5.3.7 Synchronised sound: The film and TV industries have always kept sound and picture separate. They have always used parallel channels, and modern tape formats have as many as eight separate audio tracks for each picture track. In these cases, synchronisation is a spatial rather than a temporal problem, since the signals are in

I E E Proc.-Comput. Digit. Tech., Vol. 141, No. 2, March 1994

separate channels. Once sound and picture signals have to be fitted into the same channel (for broadcasting for example), a choice must be made between the two possible ways of multiplexing them: frequency division (FDM) and time division (TDM). The virtue of FDM is that all signals are co-temporal, so guaranteeing synchronisation, but intersignal interference is a problem. TDM is freer from intersignal interference but, by definition, it destroys the temporal relationship between sound and picture. Re-establishing the synchronisation requires an extra level of management that is often the last straw on an already overloaded processor.

5.3.2 Wild sound: Sounds which do not have distinct objective correlatives, like the hum of an invisible turbine or the unseen seagull’s cry, are called ‘wild’ sounds. Unfortunately, even when the sound has no precise coherence with events in the picture it still has to start and stop at defined moments, usually in relation to cuts to new scenes. Since these times are by definition the busiest times for the processor, sound synchronisation from a TDM source is again a potential problem.

5.4 Management In interactive systems the interplay between program and data is all important. This is because neither the program nor the data for the next step can be known in advance. This means that, in general, there will be bottlenecks where both code and data must be retrieved from the storage system. The usual tricks that have been developed to meet this type of situation, which rely on seg- mentation, do not work well when an image must be delivered at a rate dictated by the display device, which has its own synchrony independent of the processor clock, and the image occupies comparatively large areas of the RAM. Add to this the need to retrieve, synchronise and reproduce the sound, at the same time as managing input devices capable of requesting the instantaneous abort of the current task and either the re-establishing of the prior context or the creation of an entirely new context based on code and data held on remote pages of the storage, and you begin to understand the difficulties faced by the designers of interactive picture systems and the reasons why Apple and Pioneer have broken the CD standard by spinning the CD-ROM discs faster.

5.4.7 Logic: The management of all the aspects of interactive design imposes tight constraints on the programming language to be used to express all the necessary activities. This is an area of active development, particularly from the stance of object-oriented programming [13]. Again, there are profound differences between design and delivery engines, and the problems reside largely with design systems. The underlying difi- culty is that, as yet, there is no way in which one can ‘parse’ images to extract their essential elements. Because of this lack of a logical structure it is impossible to build the high-level image manipulation tools that correspond to the high-level abstract structures of conventional programming.

5.42 Input devices: Keeping the system live to input while respecting the requirements of the output devices puts extreme demands on current systems, and it can only get worse as the input and output devices become more sophisticated.

Some form of point-and-click is de rigeur in interactive systems. The keyboard is no longer considered accept-


able, and voice input is becoming fashionable. Speech recognition has taken great strides recently, to the point where a small vocabulary can be recognised independent of reasonable deviations from RP speech. This task cer- tainly needs to be off loaded from a central processor if it is not to impede the rapid flow of programming; this as an achievable goal. Far less easy is the extension of point-and-click strategies from graphics to images [14].

Some devices conventionally regarded as input systems should be considered in a different manner in the context of interactive systems. The high bandwidth requirements of images and sounds in comparison with those occurring in conventional data processing have encouraged manufacturers to make devices such as Las- erDisc players free-standing systems. They are not input devices in the conventional sense, since their output does not require further processing by a CPU. In this sense CD-ROM is an input device, whereas CD-DA is a free- standing system.

5.4.3 Combining images and graphics for output: When picture devices are incorporated into conventional computer systems, provision must be made to combine the image information they can supply with the graphical information that can be generated by the processor. This is a task that is more demanding than any other activity that the processor has to deliver because of the high data rates required by images, and two distinct strategies have been adopted by computer manufactures and third-party board suppliers to achieve it.

The two methods differ in the target engine for display. Either the graphical information generated by the processor can be delivered as video (G in V, see Section 5.7), which is an interlaced signal for a conventional display, or the video delivered by the drive can be converted to the noninterlaced display format of the processor’s display subsystem (V in G, see Section 5.7). In Fig. 4 the

drive processor dfsplay

control

Fig. 4 In combining images with graphics, graphical information f iom processor can be delivered as video or video d e l i w e d by drive can be converted to display format ofprocessor’s display subsystem

‘pictures’ are either video or proprietary format depend- ing on the boards in the processor (‘overlay’ or ‘digitiser’ cards, respectively).

A more promising system is to incorporate in the disc drive such extra processing as is needed to combine the graphical information from the processor with the image information from the drive, Fig. 5. As the drive is already optimised for the data rates appropriate to the demands

Fig. 5 image capability to processor

I t is easier to add graphics capability to disc drive than to add

89

of images, it is easier to add a graphics capability to the drive than to attempt to equip the processor, which is already marginally capable of handling graphics, with the additional requirements of images [3].

5.4.4 Output devices: There is considerable confusion in the CG community about ‘video’. Video is a carefully defined display format which has the crucial property that the picture plane is scanned twice to generate the full image, the two scanning rasters each covering the plane at full horizontal resolution but providing either the odd or the even lines of the full raster. This is referred to as interlaced scanning.

Very few display processors produce CCIR-standard video. Instead, they have systems that cover the picture plane sequentially, in a process referred to as progressive scanning. Neither the number of lines nor the rate at which the pictures are delivered are standardised, manufacturers choosing to create their own ‘high-resolution’ systems.

All display systems to date in commercial production are analogue devices, so that the final stage before display of digital data is always D-A conversion. As all cameras systems are also analogue (even CCD cameras), digits are always an intermediate stage in picture-making systems which can handle real-world data. In the case of graphics, the initial data may be purely digital, from, for example, Paintbox systems.

5.5 Need for ’realistic’ pictures The decision is to use a particular level of portrayal is not always straightforward. An image of the real world is always essential if the intention of studying the image is to gain further information. In this case any degradation must be prejudicial. If, however, the intention of the picture is to convey information, that is to say it is an abstraction and construction, then the fidelity of the presentation is defined by known requirements and can be set accordingly.

As we interpret the world for ourselves from essentially temporal information, there is always a need to manage moving pictures. The fidelity of these sequences has a bearing on the way in which we as viewers respond to them. Again, a temporal sequence may seek to be an image of the external world, in which case any artifacts of the coding or display serve to detain the attention on the visual plane itself, thereby detracting from the illusion.

5.6 Need for parallel channels The underlying point of the argument presented so far is the big difference between the scale of operations involv- ing images on one hand and ‘ordinary’ computation on the other. This point must be taken even more seriously when the practical utility of imaging systems is taken into account. Two factors are germane here. First, a single image is never enough: the whole point of looking is to compare one image with another. Secondly, those two images will have been selected from a (large) pool of images that were not appropriate, but which nevertheless must have been locked at to accept or reject each of them. It follows from these points that it is not sufficient to judge a picture system on the way it handles a single image; for it to be useful it must be able to manage individual images from a bank at rates that compare with human decision times. This is a formidable task, as humans are highly specialised for reaching conclusions on incomplete data: the corner of an eyebrow is sufficient to invoke for us the idea of the whole face. The design of

90

the HCI system which facilitates this intelligent scanning of a large databank is one of the major problems of contemporary computing.

One of the easiest ways to ensure that performance is not degraded by competing activities is to give each distinct data type its own channel. A corollary of this is that the way in which each data type is held should suit its purpose best. We are aware in graphical constructions that there are several ways of representing graphical entities, each of which have their strengths and weak- nesses. Bit maps are versatile and amenable to computer manipulations like BITBLT, but do not perform well under scaling; Postscript is good for scaling but uneco- nomical for large but irregular area descriptions. Simi- larly, if no subsequent processing is required, analogue storage is very convenient and economical for images, as will become clear later. Digits are essential for data that is essentially digital, such as machine code. The trick is to pick the most appropriate form for data, not force every type into digits to meet the needs of a processor that may never be involved in manipulating them. Of necessity, these different types will occupy different channels.

5.7 Taxonomy of modes of delivery It is useful to arrange the various delivery systems in a hierarchy based in degrees of competence for the various tasks that media systems have to perform, Fig. 6.

prtures and sounds

U CD-ROM

Fig. 6 petence for tasks media systems have to perform

Various deliumy systems in hierarchy based on degrees of com-

Reading from left to right, the leaves on this tree are: (irrelevant); CD-ROM, the data version of CD-DA; Laservision; Compact Disc-Interative (CD-I, the Philips proprietary system to use CD-ROM with an interleaved data structure designed for a 68000-based engine with a real-time operating system); a PC containing extra display boards and incorporating a Laservision system as input; a PC containing extra boards and combining its output as video with a videodisc player; novel free- standing systems using both the analogue video and digital data channels on LaserDisc with a fit-for-purpose controller.

This taxonomy has been developed from the point of view of the end-user’s need for moving pictures of the real world: moving images. The fourth bifurcation in this direction, video in graphics (V in G) or graphics in video (G in V), is to the two display modes referred to in Section 5.4.3; there is no longer a need for a personal computer in the system. At present, the only realistic way of delivering full screen full motion (FSFM) images with the versatility demanded by interactive programming is analogue video, in this case LaserDisc or the Sony and Pioneer TDM video systems.

IEE Proc-Comput. Digit. Tech., Vol. 141, No. 2, March 1994

6 Compression

There has been a great deal of effort expended recently to bring down the intrinsic data rate for images to a value closer to the rate sustainable by contemporary desktop computing. A reasonable transfer rate from a typical hard disc is now about 2 Mbyte/s. This is still more than a factor of 10 too slow for CCIR 601 digital video. When the speed of the channel is taken into account, the sustainable rate drops by half an order of magnitude, and if the delivery system is CWROM, the maximum data rate is 150 Kbyte/s. (The difference between this rate and the full 176 Kbyte/s of CD-DA is the overhead of a block structure that allows highly nonsequential access of a kind that never occurs with listening to music and the need to adjust the rotation rate of the platter.)

The only way to reduce the data rate to currently sustainable levels is to compress the data for each picture into a manageable chunk. There are severe limitations on the degree to which this compression can be imposed.

6.1 What permits compression I There is a major division between the possible types of compression: those whose results can be undone, and those that can not. Lossless compression can be achieved by removing redundancy or by efficient coding strategies. Lossy compression can be achieved by throwing away information.

The best that can be achieved in lossless compression is about a factor of five. This is a reflection on the effi- ciency of the representation of the picture by equally- spaced sample bytes. If, a priori, some values are more common than others, then a more efficient coding can be constructed. Even here, the possibility of compression depends on the statistics of the picture; this is only to be expected from an information-theoretic point of view, in which the information content is defined in terms of the (logarithm of its) statistical improbability [l5]. The only hope of reducing the digital picture data rate to desktop computer proportions is by throwing some of the information in the picture away. This is not always an acceptable thing to do.

6.2 Compression and meaning Just as the early development of the video system attempted to make the best of the properties of the human visual system, so the new generation of compression systems seek to take such advantage as they can from an understanding of the human perceptual and con- ceptual systems. Compression systems based on pixel space, such as digital video interactive (DVI) cannot make use of these strategies, since there are no theoretical correlations between pixels, and only statistical methods can be used to decrease the information content. Com- pression systems based on folded or iterated spaces offer better prospects for intelligent compression, since these spaces appear to contain the same type of information as that required by the human processing system. An interesting example of the folded space representation is the Fourier domain.

6.2.1 Frequency spaces: The essential point to grasp about Fourier space applied to images is that each point in Fourier space contains contributions from every point in pixel space. A lens is a Fourier transformer: the condition of ‘focus’ is that in which each point in the object plane is transformed to a unique point in the image plane; as the lens ‘goes out of focus’, each point in the


object plane contributes to every point in the image plane. Properly handled, this seemingly muddled image (it is the real part of the Fourier transform of the object plane) has many useful properties; indeed, it may be on a version of this image structure that our visual system really works.

Imagine two digitised images, the second being related to the first by the fact that the camera taking it has moved forward a little. In general, although the ‘scene’ is the same in both pictures, no pixel in the second image will be the same as its corresponding pixel in the first image. To manipulate images usefully one must find representations that preserve useful similarities, because it is these similarities which underlie our visual responses to the images. Fourier space provides one such space in which aspects of ‘thingness’ are preserved.

Just as a one-dimensional time series of values can be represented to an arbitrary accuracy by the sum of appropriately chosen harmonic frequencies, so the two- dimensional block of pixel samples can be represented by the appropriately chosen sum of two-dimensional spatial frequencies. Instead of an array of pixels {P(i, j ) , i, j = 1, n}, this same data can be represented by the array {F(u, U}, U, U = 1, n} of spatial harmonics. The fundamental difference between the two arrays is that whereas the value of successive points in P have no mathematical connection, the values in the array F are carefully ordered as to spatial frequency. The significance of this is that whereas there can be no simple rule for discarding pixels in the P array to achieve a particular effect on elements in the scene, the consequences of discarding elements in the F array are easy to predict. For example, consider an image in which the brightness is corrupted by an interference pattern from some external signal (a not unknown cir- cumstance when using video recording in hostile environments). Every single pixel in the image is modified to some extent by this interference. However, on trans- forming to Fourier space, this interference, if it has precise frequencies, contributes to only a few of the spatial frequencies in the whole image. These few frequencies stand out from the whole as anomalous and, if they are replaced by ‘reasonable’ values deduced from their neighbours, and the modified F array transformed back to P, the interference has been removed at a stroke from the whole image.

The JPEG compression scheme is based on the idea that by discarding high spatial frequency components of a picture, thereby reducing the amount of information necessary to describe it, the reproduction of the decom- pressed image will be only marginally impaired, particularly if the target display device is incapable of reproducing the original high frequencies anyway. As with all lossy companding systems, the more familiar one is with the content of the uncorrupted image, the less satisfactory the degraded reproduction becomes. As always, the easiest feature of humans to exploit is their ignorance.

The consequences of lossy compression become more acute if motion is attempted. The MPEG proposals rely on interpolation between key frames to reduce the data rate. The most undesirable consequence of this interpolation strategy is that the only parts of interest in a moving picture, namely the things that are moving, are rendered in the reconstituted image by pixel data that, point by point, has no necessary connection with the original object. This is the computer version of impres- sionism, which is good for moving wallpaper but no good for brain scans.

91

6.3 Analogue video as ideal compression system Once the constraint of complete fidelity is abandoned, the central justification for the use of digits disappears. There is no doubt that there are aspects of picture processing where the use of digits is beneficial. The most important of these is the situation where the pictures have to go through a number of record/replay cycles. Immunity to noise pile-up and other picture-dependent degradations are best achieved by holding the image in digital form; there is, of course, a price to be paid, in this case the deleterious effects of sampling, quantisation and reconstitution. The whole problem of antialiasing, the strategies to minimise these deleterious effects, has been extensively studied [ l l ] and the contemporary solutions all involve bandlimiting in the Fourier domain.

Now that read/write video disc systems are generally available, it is interesting to consider their role as mass- storage devices for picture data. As these systems have been designed to store and deliver FSFM picture and synchronous sound in at least two independent channels, they already have most of the advantages not possessed by current desktop digital systems. They are, of course, lossy compression systems, in that they are bandlimited and subject to noise. However, these constraints are not serious when compared with the kinds of picture degradation associated with the current digital lossy compression schemes.

Modern systems use a TDM time-compressed com- ponent format. The translation to analogue display formats (RGB, composite or S-VHS) from the ondisc recording format can be achieved without significant extra loss over and above that defined by the format. The advent of fast, accurate D/A and A/D convertors makes it feasible to use these analogue systems as fast random access devices for large amounts of picture data. The digital equivalent of such a storage system requires at least 10 Mbyte/s to produce formally equivalent results. All this is achieved for full screen full motion images with no interpolation, access times of the order of tenths of seconds and zero extraction latency.

The conventional LaserDisc format, which at present holds the picture information as a PCM encoding of an FM carrier by the composite video signal, can hold, as a separate signal in the same channel, the CD digital data format for either digital audio (CWDA) or digital data (CD-ROM). This means that, in addition to 55500 still full-colour images at TV resolution, such a disc can hold 330 Mbyte of pure digital data. The possibilities for com- binations of picture and data that such systems offer is most attractive.

7 Conclusion

There is no doubt that the idea of the desktop delivery of pictures and sounds has become part of the collective consciousness of the computing community. It has also become common currency in the education and business communities, but here the understanding of what is feas-

ible is not as secure as it should be for good decision making. The problems posed by the requirements to present sounds and images in an interactive sequence are extreme, largely because of the need to make carefully managed transitions between sequences which cannot always be forecast. The exact ways in which these transitions are managed is an interesting problem in human factors research, and the devising of appropriate interfaces which permit both the design and delivery of interactive programmes needs further refinement in the light of what is known from the film and television industries about the craft of narrative construction.

In view of the extreme difficulty in delivering full screen full motion images in the desktop digital universe, some thought should be given to the exploitation of the new generation of analogue videodisc recording systems for the storage of large numbers of images at high quality and with sufficiently fast access to deliver FSFM with ease. As these systems also support a full CD digital data channel, they are very good candidates for the basis for integrated delivery platforms for interactive programmes.

As it is the acceptability in use of any system that conditions its success, and, in this acceptability, it is the HCI that makes the greatest contribution to the perceived utility, it is important to avoid technical solutions which make good interfaces difficult to implement. At present, the demands of interactive systems which require exten- sive use of accurate visual information are difficult to support by purely digital means.

8 References

1 PUDOVKIN, V.1. : ‘Film technique’ (George Newnes, London,

2 DAVIS, D.: ‘The grammar of TV production’ (Barrie and Jenkins,

3 CLARK. D.R.: ‘The demise of multimedia’. IEEE Comma. Graoh.

1933), 3rd edn.

London, 1960)

Appl., 1991.11, (4). pp. 75-80 4 GREGORY, R.L.: ‘The intelligent eye’ (Weidenleld and Nicholson,

London, 1970) 5 FRISBY, J.P.: ‘Seeing: illusion, brain and mind‘ (Oxford University

Press, Oxford, 1979) 6 LAND, E.H.: ‘The retinex theory of colour vision’, Sc. Am., 1977,

(12), pp. 108-128 7 Adapted from KELLY, D.H.: ’Spatiotemporal variation of chro-

matic and achromatic contrast thresholds’, J. Opt. Soc. Am., 1983, 73, pp. 724-750

8 HURVICH, L.M.: ‘Colour vision’ (Sinauer Associates, Sunderland, MA, USA. 1981)

9 CLARK, D.R. (Ed.): ‘ComDuten for imaeemakine’ (Peremon. . . - - . I . Oxford, i981)

10 ‘Digital fact book. Quantel, Newbury, Berks. 11 FOLEY. J.D.. and VAN DAM. A.: ‘Fundamentals 01 interactive

computer graphics’ (Addison-W&ey, London, 1982)

of music’ (Springer-Verlag. Heidelberg, 1975) 12 ROEDERER, J.G.: ‘Introduction to the physics and psychophysics

13 GOLDBERG. A.. and ROBSON. D.: ‘Smalltalk-80’ (Addison- Wesley, London, 1983)

14 CLARK, D.R., and SANDFORD, N.: ‘Semantic descriptors and maps of meaning for videodisc images’, Progrnm. Learn. & Ednc. Technol., 1986, U, (I), pp. 84-90

IS GONZALEZ, R.C., and WINTZ, P.: ‘Digital image processing’ (Addison-Wesley, London, 1977)

92 IEE hoc.-Comput. Digis. Tech.. Vol. 141, No. 2, March I994

Date post:	19-Sep-2016
Category:	Documents
Upload:	dr
View:	212 times
Download:	0 times

Constraints imposed by programme content on design, delivery and use of interactive media

Documents