Automatic Mixing and Tracking of On-Pitch Football Action for Television Broadcasts

Automatic Mixing and Tracking of On-Pitch Football Action for Television Broadcasts

R. G. Oldfield1, B. G. Shirley1

1 Acoustics Research Centre, The University of Salford, Salford, UK [email protected], [email protected]

ABSTRACT

For the television broadcast of football in Europe, the sound engineer will typically have an arrangement of 12 shotgun microphones around the pitch to pick up on-pitch sounds such as whistle blows, players talking and ball kicks etc. Typically, during a match, the sound engineer will increase and decrease the levels of these microphones manually in accordance with where the action is on the pitch at a given time to prevent the final mix being awash with crowd noise. As part of the EU funded project, FascinatE, we have developed an automatic mixing algorithm that intelligently seeks key events on the pitch and turns on the corresponding microphones, the algorithm picks out the key events and automatically tracks the action eliminating the need for manual tracking.

1. INTRODUCTION

In the broadcast of football on the television, the main problem that befalls the sound engineer is the control of the crowd noise. Whilst there are designated microphones used for recording the crowd and ambience (typically, stereo pairs and Soundfield® microphones), a standard setup (see Figure 1) would also include 12 shotgun microphones placed around the pitch used for on-pitch sounds only. These microphones are chosen to be highly directional but they still pick up much of the crowd noise, either from the rear of the microphone or from the crowd on the opposite side of the pitch. If all of these microphones were to be left high in the mix, the result would be an audio mix that is awash of crowd noise and if left low in the mix there would be no pitch sounds which would be unrealistic. Consequently the sound engineer will track the action on the pitch and will only raise the levels of a microphone when the action is near by and there is likely to be some on-pitch sounds to pick up. This process can be rather laborious for the sound engineer and also means that the only sounds that are picked up are the ones around the main action which may not tell the whole story of the match. If for example one player shouts for his team mate to pass the ball at one end of the pitch they would not be picked up by the microphones as the action would have not yet reached that point on the pitch. Also if there were to be an altercation between two players or another event auxiliary to the play during the game, the corresponding sound would not be picked up. A further problem can be

the sound engineer’s reactions; if the play switches quickly from one end of the pitch to the other he/she will have to quickly move one set of faders down and the others up, if this is not done in time the audio will not get picked up. Only raising the levels of the faders when there is action nearby also results in a fluctuation of level during the match but this is masked by the crowd noise from the designated ambient/crowd microphones so is not perceptible to viewers. These microphones are usually placed high in the stadium such that it is the ambience that is captured rather than individuals in the crowd so is more easily controlled by the sound engineer.

In this paper we present a method that not only allows multiple events from different areas of the pitch to be included in the mix at any one time but also will mix the audio automatically and will therefore alleviate the need for any manual mixing or tracking the action as is the current practice. The key element to the automatic mixing process is the separation of the unwanted crowd noise from the on-pitch sounds; this can be rather problematic as the pitch microphones are often swamped by crowd noise. However the nature of this crowd noise is that it contains few sharp transients of significant level; this means that if a significant transient is detected in any of the microphones, it is likely that it is picking up an auditory event that is not crowd noise and is therefore an on-pitch sound.

Once these sounds have been detected, the location of the sound source can also be approximately determined using a time delay estimation (TDE) technique [1], if this location is on the pitch the microphone can be made

https://www.researchgate.net/publication/3176244_The_Generalized_Correlation_Method_for_Estimation_of_Time_Delay?el=1_x_8&enrichId=rgreq-9cd3f6cf-c46f-42ec-8483-4db04e14603a&enrichSource=Y292ZXJQYWdlOzI2MzkyNTQzMztBUzoxMjcxOTYwMDU5MzMwNThAMTQwNzMzNzI5OTkxOA==

Page 2 of 8

active in the mix. This technique however can only be applied if the sound source is detected in more than one microphone. Separating and positioning on-pitch sources like this also means that they can be spatially mixed and output over stereo, ambisonics, 5.1 etc to give a more immersive viewing experience. This technique also allows for a better reconstruction of the audio scene therefore allowing consistent source positions on the pitch even as the user may navigate/interact with the broadcast scene as is one of the goals of the EU funded FascinatE project [2] of which this research is a part.

FascinatE stands for Format-Agnostic SCript-based INterAcTive Experience. The objective of the project is to allow a completely customisable viewing experience where the viewer will be able to make choices as to which area of interest he/she would like to view on the pitch and they will have liberty to navigate around the visual scene by zooming, panning etc. For the audio side of this the audio has to match the visual content and therefore it is important that a realistic and robust audio scene be recorded for quality spatial reproduction whatever the user’s navigational decision [3].

2. METHODOLOGY

In order to automatically choose when each microphone should be made active, it is possible to track the game, either with the faders on a desk, or using some automatic tracking device [4] which then communicates with the mixing desk to do an automatic mix. It is also possible to perform an automatic mix based on the contents of the audio in each microphone. These signals can be analyzed and depending on the results, the microphone can be made active/inactive in the mix. The downside of this approach is that it requires the signal to be received first and then decide whether to add the microphone into the mix, this requires that there is a slight delay in broadcast to allow for the processing and detection time of the algorithm. In most cases this is not a problem as even with live football the actual broadcast is several seconds after the actual events.

2.1. Detecting the key audio events

The algorithm presented here analyses the signals from all of the pitch-side microphones, filtering the signals with two specially designed filters depending on the nature of the audio event type to be detected (ball kick

or whistle blows from the referee). It then determines the amplitude envelope of the filtered output using the Hilbert transform, the key audio events can then be extracted when the gradient of the envelope exceeds a given threshold, i.e. a significant transient has occurred in a manner similar to a traditional noise gate. Once these sound sources have been isolated they can be positioned on the pitch based on the TDE algorithm or can be assigned an approximate zone if only picked up by one microphone.

Figure 2 Flow diagram of audio extraction algorithm

Each of the microphone signals is fed into the algorithm. The microphone signals are then filtered depending on the type of event to be extracted. Initially, just two filters were implemented. A low pass filter with a cut off frequency of 280Hz is used to extract ball kicks and a band pass filter with pass band between 3.6kHz and 4.0kHz which corresponds approximately to the fundamental frequency of an average whistle blow. Once the signal has been filtered to be particularly sensitive to particular event types, the envelope of the signal is extracted. The method used for the envelope extraction is based on the Hilbert transform of the signal [5]. For this method the envelope of the signal, ( )y t is given by equation (1)

!"# ! = !! ! + !!! ! (1)

Where !! ! is the Hilbert transform of the signal given by (2).

Filter the signal

Extract envelope of filtered signal

Calculate gradient of envelope

Pick out points where gradient exceeds threshold

Microphone Input

Page 3 of 8

(2)

From this envelope of the filtered signal the gradient is taken. This shows the nature of the signal envelope. This gradient is then analyzed and points are picked out where the gradient exceeds a given threshold. This corresponds to a transient in the signal of significant level and thus should be extracted. For a ball kick, the envelope of the low frequency filtered signal will change quickly therefore the gradient of the envelope will be high. The audio can then be extracted from the original microphone signal accordingly. Practically; each of the microphones are silenced until a significant transient is detected in either of the low-pass or band-pass filtered signals. When a transient is detected the microphone is switched on with an attack, sustain and release envelope applied as shown in figure 3. If any additional transients are detected during the sustain or release phase of the amplitude envelope, the sustain phase is extended to include them.

Figure 3 Amplitude envelope to be applied to microphone signals when transient is detected

The use of this technique can also eliminate unwanted crowd noise; for example during a corner when the corner microphones are on, it is common to hear individuals in the crowd (whose speech may well include expletives). However if the microphones are only on when there is an on-pitch auditory event, these occurrences can be minimized and the crowd noise can be recorded with the designated microphones and controlled more easily by the sound engineer.

Attack Phase Release Phase

Sustain Phase

Transient event

1

2

4

6

3

12

5

7

10 11 9

8

Stereo Pair Soundfield® Microphone

Shotgun microphones

Figure 1 Typical Microphone setup for and English Premier League match

Page 4 of 8

2.1.1. Results

A basic test was carried out to analyze the effectiveness of the automatic audio event extraction algorithm by comparing the algorithm with the actual mix that was broadcast by the BBC. The recordings came from an English Premier league match that was recorded in conjunction with an outside broadcaster (SIS Live). The match took place on 23rd October 2010 and featured Chelsea versus Wolverhampton Wanderers.

The microphone setup was as shown in Figure 1. There are tight regulations on broadcasts of Premier League football such that additional microphones could not be added. This poses one of the biggest challenges to extracting on-pitch sources because there are not enough shotgun microphones to perform any accurate array processing/beamforming operations on.

To perform the test a random one minute section of the match was chosen. This section was chopped out of each of the 12 shotgun microphone signals and listened to individually, counting the number of ball kicks and whistle blows in each. The algorithm was then run over the same set of files and the number of ball kicks and whistle blows that were extracted were counted. The results can be seen in Tables 1 and 2. In each case, whether ball kicks or whistle blows a qualitative distinction is drawn between ambiguous events i.e. it is likely but uncertain whether it is such an event due to low level, and definite events – were the event type is clear. This was done because it is often difficult for a human listener to identify the event. Some events occur on the other side of the pitch and are consequently only faintly picked up by the microphone in question.

Manual Ball Kicks Auto Ball Kicks

Mic Definite Ambig-uous Tot Definite

Ambig-uous Tot

1 5 1 6 5 1 6

2 1 2 3 1 0 1

3 3 1 4 3 0 3

4 2 2 4 0 0 0

5 0 1 1 0 0 0

6 1 2 3 1 1 2

7 1 4 5 0 1 1

8 2 1 3 0 1 1

9 1 2 3 1 0 1

10 1 2 3 1 0 1

11 5 0 5 2 0 2

12 2 1 3 2 0 2

Total 24 19 43 16 4 20

Table 1 Manual and automatic detection of ball kicks

The number of events was also counted for in the final broadcast mix and the automatic mix as shown in Table 2:

Manual Whistle Blows Auto Whistle Blows

Mic Definite Ambig-uous Tot Definite

Ambig-uous Tot

1 5 0 5 0 0 0

2 4 1 5 0 0 0

3 5 0 5 2 0 2

4 3 2 5 0 0 0

5 0 2 2 0 0 0

6 4 1 5 5 0 5

7 4 0 4 0 0 0

8 2 3 5 0 0 0

9 0 4 4 0 0 0

10 4 1 5 0 0 0

11 5 0 5 2 0 2

12 5 0 5 0 0 0

Total 41 14 55 9 0 9

Table 2 Manual and automatic detection of whistle blows

Figure 3 Number of ball kicks counted manually and automatically

Definite Ambiguous

Ball Kicks 66.70% 21.10%

Whistle Blows 22.00% 0%

Table 3 Percentage of events correctly extracted by the automatic extraction algorithm

0

1

2

3

4

5

6

1 2 3 4 5 6 7 8 9 10 11 12

Microphone Number

NUm

ber o

f Bal

l kic

ks

Definite Manual Ball Kicks

Definite Automatic Ball Kicks

Ambihuous Manual Ball Kicks

Ambiguous Automatic BallKicks

Page 5 of 8

Figure 4 Number of whistle blows counted manually and automatically

Ball Kicks Whistle Blows

Broadcast Mix 7 5

Automatic Mix 8 5

Table 2 Number of ball kicks and Whistle blows counted in the broadcast mix and automix

The results show that the automatic event extraction manages to extract 66.7% of definite ball kicks for this one minute section. The percentages are less for the ambiguous events due to having less signal to noise ratio in the microphone signals. The whistle blows in particular are harder to extract because the background noise contains more high frequency transients anyway so discrimination between these and the whistle blows is more difficult without miss-interpreting these sounds as whistle blows. However even with the transient threshold set low enough to avoid errors the comparison with the final mix shows that each whistle blow was discerned if not detected in all of the microphone signals.

The audio events extraction algorithm looks for 2 principal audio event types: Whistle blows, ball kicks but it is also possible to extract players’ communications from the pitch microphones using a third filter, however this can get very problematic as it is difficult to automatically distinguish between individuals in the crowd and players on the pitch. Currently this problem can not be resolved easily, however different microphone techniques may be able to be employed to help better discern what sounds are from the crowd and what sounds are from the pitch as described in section 3.1.1.

2.1.2. Discussion

It can be seen from the results that the automatic algorithm does not pick up all of either the ball kicks or the whistle blows when analyzing the individual microphone feeds, however when the mix down from all the microphone feeds is compared with the actual mix that was broadcast, it can be seen that the automatic algorithm has picked out an additional ball kick which was not picked up by the sound engineer moving the faders. This could be due to the ball switching rapidly from one end of the pitch to the other and the engineer not having enough time to react and increase the level of the faders to the next area of action.

The coefficients of the algorithm were chosen such that it was less sensitive to transient events, thus not allowing any erroneous events (such as crowd members, shouting, chanting etc) to be detected. The algorithm can be made more sensitive to either ball kicks or whistle blows but this increases the likelihood of erroneous events being interpreted as a key event. Of course any mistakes that are made in the automatic detection are likely to be masked by the crowd noise as recorded from the designated ambient microphones. This situation then is not too dissimilar to the current situation where the sound engineer may turn the level of the fader up because the play is in that particular section of the pitch but if there is no significant on-pitch audio it will only be the crowd noise that is picked up. So in the time domain, errors in event extraction may not be too problematic, however when positioning the sources in the sound field for a spatial audio representation of the scene as will be done in the FascinatE project, this could be very problematic as it would lead to the sound of individuals in the crowd being placed on the pitch which would obviously be incorrect.

2.2. Positioning the source on the field

Panning the on-pitch auditory events may not be desirable for a standard TV broadcast and is not the current state of affairs. If for example the active broadcast camera changes, this would require the panning to change so that the audio event appears to come form the correct direction with respect to the camera view point. This could be achieved with a simple syncing in the OB truck such that the position of the audio event/object in the stereo, 5.1 etc field would be controlled by which camera the producer makes active at that point in time but would mean constantly moving audio sources with different cuts which may be disturbing/distracting and consequently undesirable for the viewer. For the FascinatE project it is aimed that the user will have complete control over his/her visual scene i.e. they will be able to select the viewing

0

1

2

3

4

5

6

1 2 3 4 5 6 7 8 9 10 11 12

Microphone Number

NUm

ber o

f Whi

stle

Blo

ws

Definite Whistle Blows

Definite Automatic WhistleBlowsAmbihuous Manual WhistleBlowsAmbiguous Automatic WhistleBlows

Page 6 of 8

position, this will require the audio to be updated based on the individual’s decisions. With this in mind, it will not be an audio mix that is broadcast but rather a selection of audio objects and their location. It will consequently be at the user end where the audio mix will happen, allowing for a format agnostic solution where almost any audio system setup can be accommodated and providing a rendering which is unique to the individual user, based on their viewing and listening preferences.

Positioning the sources on the pitch in the sound field can be done by simply looking at which microphone has energy at that moment in time and positioning its signal as the audio object in the centre of the zone it covers as shown in Figure 1. This has the undesirable effect that there can be quite a large mismatch in visual source location and the corresponding audio object which may be noticed in the FascinatE rendering, especially if the user zooms into the pitch where the relative difference in position will be greater. This problem can be overcome to some extent if more than one microphone picks up the same audio object, as a more precise position of the source could then be inferred from the relative delays between, and strengths of, the two microphone signals. This is not a problem in non-FascinatE applications as the sources are not panned so it is permissible to have two microphones active at any one time. For FascinatE applications it is imperative that any audio object only comes from one position on the pitch.

It is also important that noise from the crowd doesn’t get positioned on the pitch (this requires that the audio extraction works well, although much of this will be masked by the crowd noise from the Soundfield® microphone(s)). If more than one microphone picks up a signal within a certain allowed time window then it is possible to determine whether or not the sound source they are picking up is the same audio event by looking at the coherence between the microphone signals, if within certain bounds it can be assumed that the sound source is the same. The temporal difference between the two signals arriving at the microphones can then be found by calculating the cross-correlation between them. This time difference can be used to roughly position the source on the pitch although it can only tell what the azimuth direction is and in terms of depth, the object will have to be positioned somewhere on a line running through the centre of the active zones. Problems could arise with this technique if the algorithm were to detect some noises from the crowd such as drum beats or people whistling etc, these could get incorrectly positioned on the pitch in the FascinatE rendering. There is the possibility of positioning two microphones at each location slightly offset with respect to each other

so it would be possible to determine the depth of the sound source and also could determine whether the source was coming from the crowd and could in that case be rejected as an audio object.

This describes the case when more than one microphone picks up the same audio source, however this is very often not the case and it is more usual that the audio event isn’t picked up by any of the microphones or is at a low level and is almost completely masked by the crowd noise. This is a difficult case and could result in, for example, a player kicking a ball with no corresponding audio. This is currently the case for standard television broadcasts and often is not a problem as the viewers don’t expect such a high level of accuracy in the audio rendering but in FascinatE this could be highly problematic as the viewer will have higher expectations, especially when zoomed right into the pitch which exacerbates the problem, making the crowd quieter and hence the pitch noises are less masked and differences will be more noticeable. This is a difficult problem to solve without the use of more microphones or ones that have a better rejection from the rear. More advanced processing techniques could be implemented in the algorithm for the purpose of reducing further the noise from the crowd as described in section 3.

3. FURTHER WORK

Whilst the algorithm can be shown to be useful and effective as an automatic mixing and tracking algorithm, there are several improvements that could be made to make the algorithm both more efficient and error free and several problems need to be overcome.

3.1. Transients from the crowd

Problems could occur if there were to be a transient event in the sound from the crowd (i.e. someone hitting a drum etc) – this means that the algorithm would detect it and then position it incorrectly on the pitch. There is a possibility of using a secondary microphone at each position which is not broadcast but only used for the positioning of sound objects, this would possibly allow the determination of whether the sources were in front or behind the microphone (if behind they are from the crowd and not the pitch) using a TDE algorithm. Both microphones would be pointing in the same direction but with one slightly behind the other, then using the time delay and level difference between the signals it will be possible to work out whether the sound is in front or behind the microphone. This would mean an extra microphone at each position and more intensive processing but would not take up any more room so

Page 7 of 8

wouldn’t contravene any rules imposed on the broadcaster.

3.2. Broadcast latency

Another problem is that the algorithm needs to ‘look ahead’ to know when a significant transient has occurred so that the microphone be turned on/off (be made active or inactive). This means that the audio transmission would be slightly late. This is possibly not a problem as long as the necessary syncing can be done because a standard broadcast is slightly behind real-time anyway. The limit of this broadcast latency is the duration of the attack phase of the amplitude envelope which in this case was 1 second. This time could be reduced but if it is made too short the switching on of the microphone would become noticeable and possibly distracting.

3.3. Distant sources

It is difficult to pick up sound objects that are in the centre of the pitch or in between the sparsely placed shotgun microphones. Although transient events can be discerned by human ear, they are ambiguous and often not picked up by the automatic detection algorithm due to the low signal to noise ratio. There is a possibility of using auxiliary microphones further from the pitch such as the Eigenmike®[6] to help position and determine the content and position of more audio objects as is a proposal for the FascinatE project.

Several techniques can be used to improve the accuracy of the audio event extraction and allow for better, more accurate positioning of sources in the sound filed.

3.4. Using Camera Data for better source positioning

Using audio data alone for the localization and positioning of audio sources can work well but in some circumstances, the lack of audio data available from more than one microphone may make it difficult to position the source on the pitch. With this is mind it is possible to use the data from the cameras to locate the position of the sources on the pitch and from this data to either turn on or point the microphones in the right direction for better capturing or to position the source more accurately in the sound field for rendering.

Figure 5 Using camera data to better position audio sources

The camera data that can be used includes, camera zoom, pan/tilt position and focal length. More and more cameras coming on to the market now days utilize camera heads that are able to provide this kind of metadata with respect to time. For the main broadcast camera (whose job it is to follow the play around) this gives an approximate position of the main on-pitch action and therefore the main audio source(s). It would provide only a rough position but when combined with data from the other cameras, it is possible to perform some triangulation to get better source positions as shown in Figure 5. This is particularly important for the FascinatE project if the rendering is to be done using wave field synthesis where each sound source will need an accurate coordinate.

3.4.1. Using player tracking for better source positioning

Another part of the FascinatE project is implementing a player tracking system. For FascinatE, this will allow the user to keep their view on one particular player or even to track the ball and follow the ball around the pitch automatically (Figure 6). If this technique is applied it will provide the necessary information to position the main source on the pitch. One can even imagine tracking the referee as well as the player which would enable an increased accuracy of whistle blows.

Pan angle

Focal Length

Page 8 of 8

Figure 6 Using player tracking in FascinatE to position audio sources in sound field

3.4.2. Preprocessing the audio files

It might also be beneficial to pre-process the audio files before doing the audio extraction; this could be done by looking at a discrete set of parameters of interest such as using the Mel Frequency Cepstral Coefficient (MFCC) or some musical information extraction algorithm. This could be particularly useful when trying to extract player’s communications.

4. CONCLUSIONS

An algorithm has been developed that extracts the key audio events such as ball kicks and whistle blows from the 12 shotgun microphones placed around a football pitch in a standard broadcast. The algorithm allows the sources to be extracted and also positioned in space for a spatial audio rendering.

The algorithm has been developed within the context of the European funded project, FascinatE but it also useful for non-FascinatE applications. For FascinatE it will be used so determine the content and position of audio objects that can be rendered using almost any rendering technique such as stereo, binaural, 5.1, ambisonics and wave field synthesis. For non-FascinatE it is a useful alternative to manually tracking the action around the pitch and adjusting the faders accordingly.

5. ACKNOWLEDGEMENTS

This research project work is part of the FascinatE project which has received funding from the European Union's Seventh Framework Programme (FP7/2007-2013) under grant agreement no: 248138.

6. REFERENCES

[1] C. H. Knapp and G. C. Carter, ‘‘The generalized correlation method for estimation of time delay,’’ IEEE Trans. Acoust., Speech, Signal Process. ASSP-24, 320–327 (1976).

[2] http://www.fascinate-project.eu

[3] JM. Batke, J. Spille et al, “Spatial Audio Processing for Interactive TV Services”, 130th Conv. Audio Eng Soc, London, UK, ( 2011).

[4] G. Cengarle, T. Mateos, N. Olaiz, and P. Arumí, “A New Technology for the Assisted Mixing of Sport Events: Application to Live Football Broadcasting.” Proc. 128th Conv. Audio Eng. Soc. London, UK (2010).

[5] H. Kuttruff, Room Acoustics: Spon press, 2000.

[6] http://www.mhacoustics.com

Date post:	10-Nov-2023
Category:	Documents
Upload:	salford
View:	0 times
Download:	0 times

Automatic Mixing and Tracking of On-Pitch Football Action for Television Broadcasts

Documents