Detection and Visualisation of Radio Frequency ...gnothnagel/downloads/... · nomical signals. A...

Detection and Visualisation of Radio Frequency Interference

A project for the course MAM4007WMathematics of Computer Science

Supervised by: Michelle Kuttel, Sarah Blyth, and Anja Schroeder

Philippa HillebrandHLLPHI012

Category Min Max Chosen MarkRequirement Analysis and Design 0 20 0Theoretical Analysis 0 25 10Experiment Design and Execution 0 20 10System Development and Implementation 0 15 10Results, Findings and Conclusion 10 20 15Aim Formulation and Background Work 10 15 15Quality of Report Writing and Presentation 10 10 10Adherence to Project Proposaland Quality of Deliverables 10 10 10Overall General Project Evaluation 0 10 0Total 80 80 80

Computer ScienceUniversity of Cape Town

South AfricaOctober 2014

2

Abstract

Radio Frequency Interference (RFI) comprises all the unwanted signals in the radio spec-trum detected by a radio telescope, which interfere with the, often much fainter, astro-nomical signals. A clear separation of RFI and astronomical signals through detection isnecessary for scientific observations. The majority of RFI signals are produced on Earth,although the sun is also a source. Earth-based signals cannot always simply be trackeddown and switched off, as they are often major communications channels, for systems liketelevision and mobile telephones. Therefore a major requirement in radio astronomy is todetect and characterize, and then mitigate, these signals. This can be done manually, butit is much more efficient to do so computationally. Here we highlight and compare six de-tection/mitigation algorithms, aiming for their possible combination and implementationfor the MeerKAT telescope. This is in a radio quiet area of the Karoo, the same site as forthe international Square Kilometre Array (SKA) project. The SKA will be the world’slargest radio telescope, consisting of thousands of receivers of which the MeerKAT is aprecursor. Here we describe the design and implementation of two RFI detection methodsbased on methods chosen from the literature.

Acknowledgements

Thank you to supervisors Michelle Kuttel, Sarah Blyth and Anja Schroeder for takingthe time to read every draft chapter and discuss the design and testing of the system.

Thank you to the SKA for funding and supplying data for this project.

Contents

1 Introduction 71.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.2 Research Question . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Background 92.1 Radio Frequency Interference . . . . . . . . . . . . . . . . . . . . . . . . . 92.2 Characterization and detection of RFI . . . . . . . . . . . . . . . . . . . . 102.3 Methods for RFI mitigation . . . . . . . . . . . . . . . . . . . . . . . . . . 112.4 RFI detection algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.4.1 Radio Astronomy Data . . . . . . . . . . . . . . . . . . . . . . . . . 122.4.2 Spectral Kurtosis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.4.3 SumThreshold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.4.4 AOFlagger . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.4.5 Morphological Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 142.4.6 Spatial Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.5 Characterization Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3 Design 173.1 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.2.1 SumThreshold Algorithm . . . . . . . . . . . . . . . . . . . . . . . 173.2.2 Final SumThreshold algorithm . . . . . . . . . . . . . . . . . . . . . 183.2.3 Surface fitting and dilation . . . . . . . . . . . . . . . . . . . . . . . 183.2.4 Variable window size . . . . . . . . . . . . . . . . . . . . . . . . . . 183.2.5 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 193.2.6 Software Development . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.3 Input and Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.4 Algorithm Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.4.1 SumThreshold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.4.2 Variable Window . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.4.3 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4 Implementation 234.1 Languages and libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.2 SumThreshold Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.2.1 Prototype 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.2.2 Optimisation 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3

4 CONTENTS

4.2.3 Optimisation 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254.3 Surface and dilation algorithm (discontinued) . . . . . . . . . . . . . . . . 254.4 Variable window algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.4.1 Prototype 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.4.2 Optimisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

5 Validation 285.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

5.1.1 Determining success . . . . . . . . . . . . . . . . . . . . . . . . . . 285.2 Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

6 Results 356.1 Case Study 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356.2 Case Study 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376.3 Case Study 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 406.4 Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 406.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

7 Conclusions and Future Work 42

Appendices 45

A Validation Results 46

B SumThreshold 51

C Variable Window 54

D Supporting Code 58D.1 SaveDataAsImage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58D.2 transpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58D.3 plotStuff . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59D.4 makeSmooth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60D.5 noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

List of Figures

2.1 A signal from the LOFAR test station. Top left: Signal with no inter-ference. Top Right: Signal with interference. Bottom: RFI removed byspatial filtering using different filter types (see §2.4.6).[4] . . . . . . . . . . 10

2.2 Map of frequency restricted regions in the Karoo [7] . . . . . . . . . . . . . 12

4.1 Diagram showing the structure of the detection and visualisation system. . 24

5.1 a) Data in the general shape of real data, but with RFI removed, and noiseadded. b) The mask produced by the SumThreshold Algorithm. c) Themask produced by the variable window algorithm. . . . . . . . . . . . . . . 29

5.2 a) Data in the general shape of real data, with a single RFI spike, andnoise. b) The mask produced by the SumThreshold Algorithm. c) Themask produced by the variable window algorithm. . . . . . . . . . . . . . . 30

5.3 a) Data with a baseline of zero, and noise. b) The mask produced by theSumThreshold Algorithm. c) The mask produced by the variable windowalgorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5.4 a) Data with a baseline of zero, a family of spikes, and noise. b) The maskproduced by the SumThreshold Algorithm. c) The mask produced by thevariable window algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5.5 a) A zoomed view of the family of spikes. b) a zoomed view of the stripesdisplayed by the variable window mask. . . . . . . . . . . . . . . . . . . . . 32

5.6 a) The data explored. b) The mask produced by the SumThreshold algo-rithm. c) The mask produced by the variable window algorithm . . . . . . 32

5.7 The SumThreshold mask searching for transient RFI . . . . . . . . . . . . 33

5.8 A complete mask, created by combining the SumThreshold (transposedand not) and the variable window masks. . . . . . . . . . . . . . . . . . . . 33

6.1 An ordinary data set with typical RFI in the frequency domain, and mini-mal RFI in the time domain, along with the masks produced by the algo-rithms designed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

6.2 A data set with typical RFI in the frequency domain, and two lines ofRFI in the time domain, along with the masks produced by the algorithmsdesigned. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

6.3 An arbitrary data set which shows the necessity of the detection algorithmsto see all the RFI within the data, along with the masks produced by thealgorithms designed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

5

6 LIST OF FIGURES

A.1 a) Data with a baseline of zero, and one small section shifted up. b) Themask produced by the SumThreshold Algorithm. c) The mask produced bythe variable window algorithm. The small shift up is treated as a baselinewiggle by both algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

A.2 a) Data with a baseline of zero, very low noise, with a broadband signal.b) The mask produced by the SumThreshold Algorithm. c) The maskproduced by the variable window algorithm. The SumThreshold methodis not sensitive to broadband RFI. . . . . . . . . . . . . . . . . . . . . . . . 47

A.3 a) Data with a baseline of zero, low noise, with a broadband signal. b) Themask produced by the SumThreshold Algorithm. c) The mask produced bythe variable window algorithm. The SumThreshold method is not sensitiveto broadband RFI. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

A.4 a) Data with a baseline of zero, and one small section shifted up. b) Themask produced by the SumThreshold Algorithm. c) The mask produced bythe variable window algorithm. The small shift up is treated as a baselinewiggle by both algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

A.5 a) Data with a baseline of zero, and one narrow spike. b) The mask pro-duced by the SumThreshold Algorithm. c) The mask produced by thevariable window algorithm. The spike is accurately flagged by both methods. 50

Chapter 1

Introduction

The MeerKAT project in Carnavon in the Karoo is a radio telescope which forms theprecursor to the Square Kilometre Array (SKA) South Africa project. This telescopedetects up radio frequency signals from celestial bodies further away than any we havepreviously observed, and will consist of an array of telescope dishes larger than evercombined before. Unfortunately, radio signals are not produced only by celestial bodies,but also by man-made objects, and are used extensively for communication.

These man-made signals which interfere extensively with the signals being observedfrom outer space are known as Radio Frequency Interference (RFI) and can be observedin the data as amplitude spikes on a frequency spectrum. If these spikes are not noticed,the data is treated as trustworthy and astronomers may assume that the spikes are aninteresting phenomenon, when actually they are just the neighbour starting his car.

For this reason, we apply signal processing techniques to the data, trying to findthe signals which are statistically significantly different from the underlying noise. Thisunderlying noise is the actual astronomical data and so it is particularly important thatthe noise is not marked as RFI. Output is some form of mask, which allows the astronomersto know which channels are corrupted, and which contain viable information.

The simplest form of RFI detection is known as thresholding. This means settingsome value above which the data is flagged as RFI. In some circumstances this is donesymmetrically, so if the data is lower than some value it is also flagged. There are moreadvanced forms of detection, which mostly build on the idea of thresholding.

1.1 Problem Statement

The aim of this project is to adapt and compare two methods of RFI detection which canthen be used in characterisation of the RFI, and to determine the type of RFI which isbeing produced in the environment. The effectiveness of the algorithms will be evaluatedaccording to how fast they are able run, how sensitive they are to changes in the data,how much known RFI they are able to detect and how many false positives there are inthe output mask.

1.2 Research Question

RFI and astronomical signals (radio waves produced by a source) both come in manydifferent forms, which makes detection of RFI difficult. Also, the amount of data recorded

7

8 CHAPTER 1. INTRODUCTION

by a radio telescope is very large, so any detection algorithm is required to be as efficientas possible. As such the following question will be investigated: Is it possible toadapt an existing detection algorithm to the supplied data, and add anyform of characterization to that algorithm? As seen in the past work, Offringaet. al.[15, 16, 17] have worked extensively on detecting RFI in array type telescopes.The data for this project, however, is collected, formatted, and stored differently. Thechallenge is therefore to apply existing methods to the new data. The characterizationof a particular signal has not been researched in as great depth, and so to design analgorithm to appropriately characterize the signals may be beyond the time scale of thisproject.

1.3 Approach

The approach taken to solve this problem follows a simple route. We begin by lookinginto the solutions produced by others on similar problems, and examine the specifics ofthe SKA project and relate the solutions to the problem. We then choose two appropriatemethods to implement for this project. These algorithms are described in detail in Chapter2.

The next step is to design the algorithms to work with the data collected on theMeerKAT site. This process is shown in Chapter 3. In Chapter 4 we document theprocess of building up the system, and developing the chosen algorithms. This includesthe development of a new algorithm which makes use of previous ideas, but implementsthem differently.

From there we move to the validation of the code in Chapter 5, and a discussion ofthe results of running the code on real data in Chapter 6.

Chapter 2

Background

The very first radio map of the skies was produced in 1942 by Reber, an amateur, whowas intrigued by Jansky’s observations of the Milky Way in 1932[2]. Since then radiotelescopes have developed to the point where there are two main types: there are largesingle dish telescopes (such as Arecibo[2]) and arrays of smaller dishes (such as the Low-Frequency Array (LOFAR), which has recently become fully operational[4, 16]). Thesetelescopes make two different types of observations: active; utilizing RADAR1 technology,and passive; picking up radio waves emitted by astronomical sources.

As radio telescopes become larger and more sensitive, more data on astronomicalobjects can be collected, leading to a much better understanding of the universe[2]. Tothis end, the Square Kilometre Array (SKA) telescope has been commissioned, which willbe the largest radio telescope in the world. The SKA project, first discussed in 1993, hasgrown into a global project, located in South Africa and Australia[21]. The MeerKATproject in the Karoo is the precursor to the South African part and will become a partof the SKA. MeerKAT will consist of 64 antennae, with the maximum distance betweenthe dishes being 8 km. The first dish was raised on the 27th of March 2014[24].

2.1 Radio Frequency Interference

Radio frequency interference (RFI) is electromagnetic interference (EMI) from signalsin the radio frequencies of the electromagnetic spectrum (Figure 2.1). As EMI can becaused by any type of electrical circuit, sources of RFI are abundant. What is consideredRFI is subjective, and dependent on the type of observation being made[5]. Because RFIsignals (transmitted by a source) are mostly much stronger than the astronomical signalsobserved, this can overload the sensitive receivers, causing errors in the calibration ofthe signal. RFI can also occur in the same frequency as an astronomical signal, causingambiguities and ‘ripples’ in the observed spectrum.

From the antenna, the radio signal is converted from analogue to digital, and thencorrelated with the signals from other antennae to create a complete picture of the obser-vations. RFI can be created anywhere along this path. RFI can be categorized into twobroad groups: narrow-band RFI (intentional transmissions, such as television signals, orFM radio signals) and broadband RFI (unintentional transmissions, such as those emittedby electric circuits, and power lines)[19]. It may be possible to find and shield broadbandsources more easily than narrow-band.

1Radio Detection And Ranging, used originally for detecting aircraft.

9

10 CHAPTER 2. BACKGROUND

Strong RFI signals (a signal is transmitted by a source) can completely drown outweaker signals of astronomical importance in the same channel (a channel is a set offrequencies grouped together to make data storage easier). This can cause a significantloss of data, as it can be necessary to completely ignore all signals found on the channel.The L-Band (around 1420 MHz) is important because this is where spectral lines denotingneutral hydrogen in a celestial body can be observed. Unfortunately, there are manyRFI signals in this channel[10], which make it difficult to differentiate valid signals andinterference. The effects of the interference are shown clearly in Figure 2.1. The diversityof radio signals makes the detection of RFI challenging.

Figure 2.1: A signal from the LOFAR test station. Top left: Signal with no interference.Top Right: Signal with interference. Bottom: RFI removed by spatial filtering usingdifferent filter types (see §2.4.6).[4]

2.2 Characterization and detection of RFI

Every RFI signal has unique characteristics which can be used to characterize the signal,such as strength, geographical location or position of the source, polarization, direction,orientation, periodicity over time, bandwidth, frequency distributions, modulation andencoding[18, 5]. Some characteristics, such as strength, are easy to identify for a singlesource, while others, such as polarization, are more difficult to determine. Characterizinga signal is useful, as it becomes much easier to locate the source and either shield it,have it switched off, or deal with the signal during the processing of the data collected.

2.3. METHODS FOR RFI MITIGATION 11

Knowing the polarization of the signal is useful, because astronomical signals are veryweakly polarized, if at all, whereas RFI is usually strongly polarized.

Characterization also impacts on the detection algorithms, in that two signals can becompared if they have been characterized, and so it is possible to determine RFI signalsthrough similarity with known RFI. It is also good to be aware of the radio atmospherearound the sensitive equipment, and to know when something changes, to make predictionof behaviour easier[18].

RFI detection and characterization algorithms aim to detect RFI, characterize andidentify the signals for ease of management, flag the signals[19] and then mitigate theRFI in a manner that will lose the least possible astronomical data. This can be done byremoving a point (frequency, time) which has been flagged[4].

2.3 Methods for RFI mitigation

One of the easiest ways to minimize RFI around a radio telescope is to declare the region tobe radio quiet, which means that no transmitting or receiving radio devices are permittedwithin a certain distance of the telescope. This is difficult to enforce, as discovered at theMedicina telescope in Italy[3]: the growth of nearby cities cannot be curbed, and often theradio quiet region is encroached upon. For this reason, the MeerKAT and SKA projectsare based in the Karoo, far from any large settlements. The Astronomical AdvantageAct[7] enforces restrictions on frequencies by region (shown in Figure 2.2). These regionssurround the core of the MeerKAT and SKA projects. Unfortunately, it is not possibleto find a region with absolute radio quiet, independent of the regulations set in place.Satellites and aeroplanes still pass overhead and some signals are very long distance, suchas television signals. So, beyond radio quiet regions, the International TelecommunicationsUnion (ITU) has released a table specifying frequency allocations for different types ofcommunication. This table is then specialized by the communications authority of eachcountry to be applicable. The Independent Communications Authority of South Africa(ICASA) has released the relevant table for South Africa[8]. This allocates a relativelysmall number of narrow frequency bands to radio astronomy, and, commonly, these bandsare shared with other communications areas. It is illegal for a signal to be transmittedoutside of the allocated frequency, so signals detected in these areas can be turned off bythe authorities (ICASA). If radio astronomy wishes to make use of a wide bandwidth offrequencies, there will be a large amount of RFI present which is entirely legal[5].

If it is impossible to avoid RFI, detection and mitigation schemes need to be developed.The many different types of RFI lead to many different detection algorithms. Many ofthese are designed for specific instruments or projects, and so are not directly suitablefor all astronomical data. These algorithms can be compared, combined, and modified toprovide a situation-specific solution.

2.4 RFI detection algorithms

The simplest form of detection is thresholding, which tests the strength of a signal againsta predefined threshold value and, if the signal is above that value, flags it as RFI. This canbe done with any kind of data, but is often done after the Fast Fourier Transform (FFT)part of the correlation process. The algorithms we consider here all work post-correlation,


Figure 2.2: Map of frequency restricted regions in the Karoo [7]

meaning they can work on saved data. A reference antenna (as is used at the MeerKATsite) is used to compare signals to aid in the detection.

2.4.1 Radio Astronomy Data

Radio astronomy data is collected for many different purposes and in many different ways,with different emphases. The data could be collected on a satellite (the SMOS project[18])or by Earth-based radio telescopes. These telescopes vary design, data collected, andcollection method. They can be single- or multi-dish (or beam) and they can observeactively or passively. Telescopes with either a multi-beam feed, or an array of dishes, havetheir data correlated and so calculate a covariance matrix, as used in the spatial filteringtechnique. The SKA project will be made up of a large array of passively observingdishes[21], so we may use these techniques. Currently on the MeerKAT site, there is asingle antenna observing the environment for RFI. This antenna is used to detect andcharacterize, as well as visualize RFI before the full telescope becomes operational, whichwill make RFI mitigation easier later. Therefore at this stage techniques which requiremultiple antennae will not be usable.

2.4.2 Spectral Kurtosis

Spectral Kurtosis (SK) is a statistical method used for RFI detection, which is usuallyapplied to time-averaged, non-Gaussian data, but can be extended to other data types[1].

2.4. RFI DETECTION ALGORITHMS 13

SK is a thresholding method, applied either during or after the FFT[11], and is appliedequally well in frequency and time domains. The spectral kurtosis can be calculated using

V 2k =

σ2k

µ2k

, (2.1)

where σ2k is the variance and µ2

k is the mean of the power spectral density (PSD). A samplewith no RFI will have V 2

k = 1. The mean and the variance is done for M spectral estimatesPki, where k is the channel number and i = 1, . . . ,M . These are used to calculate theinstantaneous power spectral density (PSD) S1 and the squared spectral power S2,

S1 =M∑i=1

Pki (2.2)

S2 =M∑i=1

P 2ki. (2.3)

Then the mean and variance are given by

µk =S1

M(2.4)

and σ2k =

MS2 − S21

M(M − 1). (2.5)

This gives

V 2k =

M

M − 1

(MS2

S21

− 1

). (2.6)

The variance of V 2k is then calculated, and compared to the expected value of

var(V 2k ) =

{24/M, k = 0, N

4/M, k = 1, . . . , (N − 1)[12],(2.7)

where N is the Nyquist rate associated with the sampling rate. The Nyquist rate is theminimum rate at which a signal can be sampled without introducing errors, it is twice thehighest frequency in the signal.[23] If the variance is significantly different from a baselinelevel such as the median, the signal can be considered to be RFI.

A good implementation of the SK method requires a full understanding of all thestatistical techniques involved. The complexity of the algorithm depends on how manywindows of size M are used, so giving a worst case O(N2) complexity. The SK methodis suitable for use on any type of data, but, as a purely statistical method, does not holdmuch interest from a computing perspective.

2.4.3 SumThreshold

The SumThreshold method is a form of combinatorial thresholding, which means thatsamples are not only checked individually for high values, but also are combined to checkif two or more neighboring samples are all above a slightly lower threshold value. Theflagging function for frequency and time can then be given as

flagνM if ∃i∈{0...M−1} : ∀j∈{0...M−1}|R(ν + (i− j)∆ν, t)| > χM (2.8)

flagtM if ∃i∈{0...M−1} : ∀j∈{0...M−1}|R(ν, t+ (i− j)∆t)| > χM , (2.9)


where M is the number of samples in a combination, χ is the threshold value, and |R(ν, t)|is the value of the sample at time t and frequency ν. A sample can be flagged in either timeor frequency. Once a sample has been flagged, its value is changed for future combinationsto be the average threshold size (χM). This lowers the frequency of false positives in theflagged data[14]. The difficulty in this approach lies in calculating appropriate χ values,although it may be possible to make use of Spectral Kurtosis to do this. Much as forSK, the complexity depends on both the number of samples, and the iterations throughcombinations up to size M giving a worst case O(N2) time. The method can be used onany type of data, making it suitable for this project, and the main interest would be incomparison with SK.

2.4.4 AOFlagger

The AOFlagger is an algorithm which was implemented at LOFAR in 2010[16]. As in-put, it takes information on a single polarization or set of Stokes I data (an integrationtechnique used to join all the data into one spectrum). The amplitudes are calculated,and a thresholding technique is used to generate the first flags. The channels (frequency)or time steps (time) are then compared based on root-mean-square (rms) values, to flagthe outliers. The data are then fitted to a 2D Gaussian surface, again to smooth outoutliers. The process is then iterated, increasing the strictness of the threshold until thedata converges on the surface. A dilation is then performed on the data, flagging furtherRFI around the edges of the channels or time steps, on the supposition that not all theRFI was found. At this point the flags can be compared with the original data[13]. Themost difficult part of the AOFlagger is in the dilation step, ensuring that the flags arenot spread too far, thus flagging channels or time steps unnecessarily. The complexityis certainly above linear time, as the data are fitted to a 2D surface, which requires atthe least O(N logN). The algorithm is certainly suitable for the data produced, and hasinterest when considered in conjunction with basic thresholding, techniques as well asmore advanced techniques.

2.4.5 Morphological Algorithm

This algorithm was designed for the LOFAR telescope, and so is suitable for extension tothe SKA, as the two telescopes are similar. It combines a number of techniques, such asthresholding and the use of reference antennae, which give good estimates of frequenciesin which there is RFI. The algorithm utilizes the fact that most RFI signals are paralleleither to the time or the frequency axis. It builds particularly on the AOFlagger algorithm(§2.4.4). The key concepts used are morphology, and the idea of a scale invariant rank(SIR) operator. An SIR operator is a mathematical operator ρ for which ρ(λx) = λρ(x),where λ is a constant. The operator must be of the SIR type, because RFI signals arethemselves scale invariant, meaning that they are not affected by scaling the data. TheSIR operator is applied after a basic flagging method and is applied separately to timeand frequency. The operator can be defined as

ρ(X) =⋃{[Y1, Y2) : |X ∩ [Y1, Y2)| ≥ (1− η)(Y2 − Y1)}, (2.10)

where [Y1, Y2) is a half open interval in either frequency or time and η gives the aggressive-ness of the operator (meaning that η = 0 will flag nothing, and η = 1 will flag everything).To recombine the time and frequency channels, either a union of the two can be taken,

2.5. CHARACTERIZATION METHODS 15

or the operator can be applied sequentially in each channel. The sequential combinationis more aggressive than the union and the order of the sequence will influence what isflagged[17]. A full proof of the scale invariance of operator ρ, as well as the full algorithmin O(N) time was given by Offringa et al[17]. The algorithm has been fully implemented inO(N) time. It is predominantly of theoretical interest and uses interesting mathematicalconcepts.

2.4.6 Spatial Filtering

Spatial filtering aims to reduce the RFI levels in a sample to the point where they can be“seen through” to view the astronomical signals. Thus it is a mitigation method, althoughit can be used for combined detection and mitigation. The spatial filtering technique isbased on the manipulation of the covariance matrix C formed by correlation of the datafrom multiple channels (dishes or beams). The background astronomical signals andsystem noise are considered to be Gaussian noise[4]. The eigenvector and eigenvaluematrices are found, giving C = UΛUH, where Λ is a diagonal matrix containing theeigenvalues in descending order, U is the matrix of eigenvectors and UH is its Hermitianconjugate. The Hermitian conjugate is found by taking the transpose of the matrix andreplacing each value with its complex conjugate. Either it is assumed that the RFI hasthe strongest signal in the system and the first value in Λ is given a null value, or a filteris applied. The filter can be either a projection filter, which gives a projection of C ontothe noise subspace (giving C = PNCPN, where PN is the projection) or a subtractionfilter, where the projection onto the interference subspace is subtracted from the system(giving C = C−PICPI)[9].

2.5 Characterization Methods

RFI characterization methods draw heavily on the detection methods, as a signal cannotbe characterized before it has been detected, and many of the principles in detectionand characterization are the same. Some characteristics are easy to find. The SMOSproject[18], which measures the brightness temperature (BT) of Earth, found the powerof the RFI signal to be directly proportional to the BT. They also suggested that thedirection of a pulsating source can be found by analyzing the pulses. The SMOS issatellite-based, so not directly applicable to the SKA, but many of the principles remainthe same. Another group working with synthetic aperture radar[10] match the frequencyand time stability of a signal to a known signal, from a specific type of radar tower. Theyalso correlated geographical position. Unfortunately, the majority of their characterizationis done as part of the detection of the signals.

2.6 Conclusions

In Table 2.1, the six algorithms in Section 2.4 are compared based on a number of factors.In this table it can be seen that some algorithms are more suitable to the data collectedat MeerKAT than others, and some are more complete (or higher level) than others.The morphological algorithm (§2.4.5) is an example of a high-level algorithm suitable forthe data. This algorithm does have room for extension, however, as the sub-algorithmof SumThreshold (§2.4.3) could be replaced with another, and characterization methods


could be added to it. The spatial filtering algorithm (§2.4.6) is even higher level, goingso far as to mitigate the RFI. This could quite easily be extended, by only applying thealgorithm to samples already flagged, but it is not suitable for this data, as it requiresan array of inputs. The main methods of interest are the flagging methods and thecharacterization methods. It would be interesting to combine these methods to flag datanot just as RFI, but as a specific type of RFI, which could then be visualized, so that theradio environment of the MeerKAT area can be more intuitively understood.

Table 2.1: Comparison of algorithms discussed in 2.4, with scores given from 1-3 for eachsection, where a higher score means a higher ‘value’ in that section.

Algorithm Features Difficulty Complexity Suitability InterestSpectral Kurtosis 1 2 3 3 1Morphological Algorithm 2 2 1 3 2AO Flagger 2 1 2 3 3SumThreshold 1 2 3 3 3Spacial Filtering 2 3 3 2 1

Chapter 3

Design

3.1 Goals

In this project, we aim to determine an effective method for detecting and possibly char-acterising Radio Frequency Interference (RFI) in radio signals, particularly focussing onsignals received from radio telescopes. As the data files are large (3600×14200 values perfile), the method should be efficient in terms of both time and space.

3.2 Approach

We select two algorithms to be implemented and compared in discussion with the twoAstronomy supervisors. We choose algorithms based on Table 3.1, with a focus on highsuitability and low difficulty.

Table 3.1: Comparison of algorithms discussed in Chapter 2, with scores given from 1-3for each area, where a higher score means a higher ‘value’ in that area.

Algorithm Features Difficulty Complexity Suitability InterestSpectral Kurtosis 1 2 3 3 1Morphological Algorithm 2 2 1 3 2AO Flagger 2 1 2 3 3SumThreshold 1 2 3 3 3Spatial Filtering 2 3 3 2 1

The two algorithms chosen are the SumThreshold method and the AOFlagger method.From these chosen algorithms the final methods are developed.

3.2.1 SumThreshold Algorithm

The SumThreshold method is a combinatorial thresholding method which, rather thansimply checking if a value is above a specific threshold, includes the surrounding valuesin the computation. The flagging part can be given in equation form as

flagνM if ∃i∈{0...M−1} : ∀j∈{0...M−1}|R(ν + (i− j)∆ν, t)| > χM

flagtM if ∃i∈{0...M−1} : ∀j∈{0...M−1}|R(ν, t+ (i− j)∆t)| > χM .

This can be put into pseudo code as follows:

17

18 CHAPTER 3. DESIGN

Set M, sum, threshold, maxM

For each window of size M_i (from M to maxM stepping 2, 4, 8, ...)

set count = number of unflagged values in the window

set sum = sum of all these values

if (sum > count * threshold) OR (sum < -count * threshold)

set a flag on unflagged values

set values to be an average

move the window to the right

set the threshold for the new window position

3.2.2 Final SumThreshold algorithm

After a few optimisations during the implementation phase (Chapter 4) a final algorithmis left which is slightly changed from the original. The pseudo code is as follows:

Set M, sum, threshold, maxM

For each window of size M_i (from M to maxM stepping 2, 4, 8, ...)

set sum = sum over j in window (value at j) - chi

if (sum > 0)

set a flag on unflagged values

set values to be an average

move the window to the right

set the threshold for the new window position

3.2.3 Surface fitting and dilation

The AOFlagger method is an extension of a thresholding method which adds surfacefitting and dilation to the algorithm. The initial algorithm attempted was:

Row-wise repeat:

Do (at least twice):

- Replace flagged data with median value

- Create spline interpolated surfaces

- Compare values between interpolations,

flagging those beyond a certain level.

end do

end repeat

However, after beginning the implementation of the system, this algorithm was discarded,and a brand new one developed, the variable window method.

3.2.4 Variable window size

The variable window algorithm was developed in discussion with supervisors Sarah andAnja, and is an attempt to find an efficient way of checking all the data. This methodmakes use of a smoothed surface which underlies the data at every time period. Thissurface is used as a comparison, or base threshold value and then a two-dimensionalwindow is placed over the data. The size of this window depends on the rate of change ofthe standard deviation of the data. So, when the standard deviation is changing quickly,we assume that there are larger spikes in the data, and so use a smaller window. If the

3.2. APPROACH 19

standard deviation changes slowly, we assume that there are fewer large spikes, and souse a larger window. The algorithm in pseudo code is:

repeat process a number of times

Set window size and position

loop through entire surface

find standard deviation (s.d)

find change in s.d (average over three)

flag window (look for points 5 * s.d out)

vary window size

shift window on

end loop

end repeat

3.2.5 System Architecture

The system is originally described by the following diagram, where the greyed out partsdeal with visualisation be implemented by Gerard Nothnagel, and so are not discussed inthis work:

The modifications to the algorithms lead to a modification of the system architecture,and so the final system is described by the following diagram:


The RATTY data is data collected on the MeerKAT site, and is provided for use byChristopher Schollar. The smoothed surface included in the grey oval is the underlyingsurface which the data is compared to in the variable window algorithm. It is required asan input to the system.

3.2.6 Software Development

We follow an iterative approach to the development of the software, focussing first onthe requirements, then producing a detailed design, then implementing the design beforetesting and validation. This cycle is then repeated until a satisfactory result is achieved.We follow this process because the original algorithms have already been documentedand so the initial design phase consists predominantly of adapting the algorithm to thesituation. This means that the design should be finished before implementation begins,which is based on the waterfall process. The implementation is managed using versioncontrol through Git. This allows for more flexible implementation and experimentation.Sections are tested as they are developed, drawing from the concept of unit testing toensure code integrity.

3.3 Input and Output

Input is in the format of HDF5 files (Chapter 2, Section 2.4.4) containing data collectedon the MeerKAT site, which have a row for every time at which data was collected anda column for every frequency channel. The output is a new HDF5 file which contains amask for the original file. This means that, if a value is flagged with 0 it has no RFI, andif it is flagged with 1 there is RFI of some type.

3.4. ALGORITHM ANALYSIS 21

3.4 Algorithm Analysis

3.4.1 SumThreshold

Input: array hight m, width n

1. load into memory

2. Create matching mask

3. loop m times

4. while run < r

5. while pos + l/2 <= n (step size= l/2)

6. set chi

7. flag window of length l

8. save and close files

This gives a very basic description of the algorithm which can be used to find the complex-ity. Lines 1, 2 and 8, will add a term of O(3 ·n ·m) to the complexity. The loop beginningin line 3 adds a factor of m. The loop in line 4 adds a constant factor r, bringing thecomplexity up to O(r ·m+3 ·n ·m). The choosing of the threshold value can be viewed asa non-trivial constant time calculation which takes time c. Flagging the window dependson its length l, and takes O(2l) when the window must be flagged. The loop in line 5

contributes a factor of2n

l. So over all the complexity of the algorithm is:

Complexity = O(r ·m · 2n

l· (2l + c) + 3 · n ·m)

= O(r ·m · 2n · (2 + c) + 3 · n ·m)

= O((4r + 2rc+ 3) ·m · n)

= O(k ·m · n)

Where k is some fairly large constant. It is this factor k which must be optimised toimprove the performance of the algorithm.

3.4.2 Variable Window

Input: array hight m, width n

1. load two files into memory

2. Create matching mask

3. loop k times

4. while time position + 1/2 time dimension <= m

(step 1/2 time dimension)

5. while frequency position + 1/2 frequency dimension <= n

(step 1/2 frequency dimension)

6. calculate sigma

7. flag window

8. change window size if appropriate (time never changes)

9. save and close files

As in §3.4.1, lines 1, 2, and 9 give a single term for the complexity, of O(4 ·m · n). Line 3

gives a factor of k. Line 4 gives a factor of2m

c1, where c1 is the smallest value for the time


dimension. Line 5 gives a factor of2n

c2, where c2 is the smallest value for the frequency

dimension. Lines 6 − 8 can be calculated in some constant time, say c3. This gives theoverall complexity as:

Complexity = O(4 ·m · n+ k · 2m

c1· 2n

c2· c3)

= O((4 +4c3k

c1c2) ·m · n)

= O(K ·m · n)

Where K is a non-trivial constant factor. This factor K is what must be optimised forbest performance.

3.4.3 Comparison

To properly compare the algorithms analysed in §3.4.1 and §3.4.2 we look at their constantfactors. To do this we assign values to various constants which can be found in the codelistings in Appendix B and C. We have for the SumThreshold method:

k = 4r + 2rc+ 3

= 4 · 7 + 2 · 7 · c+ 3

= 31 + 14c

And for the variable window method:

K = 4 +4c3k

c1c2

= 4 +4 · 3 · c332 · 128

= 4 +3c3

1024

We can assume that the values c and c3 are comparable, as they are both constant factorswhich contain the focus of the code. Thus we can show the difference between k and Kas:

K − k = 4 +3c

1024− (31 + 14c)

= −27 + c(3

1024− 14)

= −27− 13.997 · cK = k − 27− 13.997 · c

Since c must be a positive value, it should come as no surprise that the variable windowmethod runs significantly faster than the SumThreshold method.

Chapter 4

Implementation

Her we discuss the implementation of the two algorithms chosen for development. Thesewill further be tested and validated with simulated data (Chapter 5) and then have casestudies performed of how they react to real data (Chapter 6).

With regards to the design of the system, some aspects changed during the imple-mentation process. The original design can be seen in Figure 4.1. The first algorithm,SumThreshold, incorporated a separate script to transpose the data file before inputtingit to the algorithm. The second algorithm underwent major changes over the course of theimplementation, as it is a more complex system. The original design of fitting the datato a surface was modified into a system which uses a smoothed surface and the standarddeviation of a window to search out the larger and smaller RFI in different ways, whilstallowing for noise. Thus the shaded oval was added to Figure 4.1. More details on theimplementation of each algorithm follow below.

4.1 Languages and libraries

The algorithms are all implemented in the Python programming language. This languagewas chosen as the developers at the SKA already work predominantly in Python, andthere are many very powerful scientific libraries written for Python[6], such as the h5pylibrary[20] which allows a Python script to read a file in the HDF5 file format. This isnecessary since the astronomical data is all stored in HDF5 files, which compress the datato a storable size. Another library used extensively is Numpy[22], a library which allowsadvanced manipulation of arrays of data, making finding statistical values for a sectionof data simple.

4.2 SumThreshold Algorithm

This is the first algorithm implemented. A full description of the original algorithm canbe found in Chapter 2. This algorithm is a combinatorial thresholding method, whichmeans that, rather than only checking if every data point is above some threshold valueχ, a window is moved across the data. Then, for every pass the sum of unflagged datapoints is compared to a lowered threshold value. This can be shown by the equations

flagνM if ∃i∈{0...M−1} : ∀j∈{0...M−1}|R(ν + (i− j)∆ν, t)| > χM

flagtM if ∃i∈{0...M−1} : ∀j∈{0...M−1}|R(ν, t+ (i− j)∆t)| > χM .

23

24 CHAPTER 4. IMPLEMENTATION

Figure 4.1: Diagram showing the structure of the detection and visualisation system.

4.2.1 Prototype 1

To begin, we created a crude implementation of the algorithm as described in Chapter 3.In this process, some issues were discovered, such as:

1. It can be tricky to decide on a suitable thresholding value (χ) above which all datapoints are flagged. We decided to use statistical relevance checks. So the χ value isset to be the median value increased by 5σ. This is then decreased with each pass to3σ, which gives the lowered χ value for the combinatorial step. This was discoveredto be necessary when performing validation tests with a smooth increasing surface:the surface was flagged as RFI when the slope was positive.

2. The algorithm proves to be unreliable on the edges of the data, an acknowledgedissue in signal processing, as there is insufficient data around the specific points toget an accurate median value. This issue can only be solved by counting the fringevalues as unreliable, and measuring a little wider than is required for measurement.

3. Part of the optimization of this algorithm is determining the initial window size, aswell as the rate of growth and the number of passes to be made. It is unreasonableto begin with a window of size one, which checks every point, as this will slow thealgorithm to worse than real time. To achieve real time, a single row of data shouldbe processed in a second or less. This problem is considered in the optimisationslisted below.

4.3. SURFACE AND DILATION ALGORITHM (DISCONTINUED) 25

4. The original implementation took a long time to run.

4.2.2 Optimisation 1

The χ calculation was modified to be independent of the window size, which reducedthe χ calculation to constant time. This reduces the complexity of the algorithm, andincreases its speed. The second optimisation changed the step size of the window. Asstepping through every point multiple times is inefficient, this was changed to begin witha step size of 6, which then increases with every pass so that larger windows have a largerstep size. This optimisation cut running time down to below 30 minutes on average fordata collected over one hour, giving half real time. This also allows for a user to decidewhether accuracy or time is more important. The step size and number of passes can beparametrized to allow a user to set their own values:then a user looking for high accuracywill set the step size very low and the number of passes higher.

4.2.3 Optimisation 2

Further testing after optimisation 1 brought some glaring errors to light. Optimisation 1was tested before the correct version of χ was used. Changing the value of χ meant thatthe subroutine for the combinatorial flagging in the window needed to be reviewed. Theoriginal method was:

flag window:

for point in array:

if point not flagged:

add to sum

increase counter

if abs(sum) > abs(counter * chi):

flag entire window

This does not work, as the majority of the data is negative, but there are some RFI spikeswhich are positive. To account for these negatives the algorithm was modified to

flag window:

for point in array:

if point not flagged:

sum += (point - chi)

if sum > 0:

flag entire window

This gives the sum of the distances of the points from the threshold value. So pointswhich are below the value will have a negative impact, and those which are above willhave a positive impact on the sum. The main method was also modified to force the stepsize to be equal to the length of the window, ensuring that no points are ever missed.

4.3 Surface and dilation algorithm (discontinued)

This algorithm was originally going to be based on the AOFlagger model explained inChapter 2 §2.4.4, which fits the data to some surface and then expands all the flagged

26 CHAPTER 4. IMPLEMENTATION

areas based on the assumption that RFI will occur in larger regions than are actuallydetected.

Using spline interpolation it is possible to create a smoothed version of the data againstwhich to perform checks. The pseudo code for this original algorithm is as follows

Row-wise repeat:

Do (at least twice):

- Replace flagged data with median value

- Create spline interpolated surfaces

- Compare values between interpolations,

flagging those beyond a certain level.

end do

end repeat

This method, which interpolates every row of data takes a very long time to run(around 2 days). This time is unacceptable and does not give sufficient accuracy towarrant longer than real time processing.

After discovering that it takes about 20s to perform a spline interpolation on a singlerow of data, the algorithm was rethought a little. This involved preprocessing the datato act as a smoothed surface, which in itself takes a long time, but that one file can beused to process many different data sets. This improves the speed greatly, moving to takeonly a few minutes to perform a detection for an hour’s worth of data. Unfortunately thealgorithm has moved away from the original idea, and no longer is very different from abasic thresholding algorithm. At this point we discarded this algorithm and moved to thevariable window method.

4.4 Variable window algorithm

4.4.1 Prototype 1

A system of a fixed window size which ran through the entire surface was initially built.Some noteworthy errors were made during the implementation. The first error was thatthe window moved diagonally through the data, only looking at a band from the top leftto the bottom right. The second thing that required some time to solve was the necessityof a standard deviation calculation with a predefined mean value. There is a standardmethod for doing this in Python 3, but not in Python 2. Porting the algorithm to Python3 was considered, but the difference between Python 3 and Python 2 is sufficiently largethat this became infeasible very quickly. So it was necessary to write a standard deviationmethod.

4.4.2 Optimisation

From the system with a fixed window, adding in the window variations was fairly simple.The system takes three steps to produce an accurate representation of the rate of changein the standard deviation, and uses an average over the last three steps to calculate thisvalue. A look up is then used to determine the window size of the next step. The smallestwindow is 128 × 32, which is for a rate of change greater than 2. The middle size is192 × 32, for a rate above 1. The largest size is 256 × 32 for all smaller rates of change.

4.5. CONCLUSIONS 27

The process is repeated three times, which gives reasonable accuracy, and only requiresabout 20 minutes of processing time on data collected in one hour.

4.5 Conclusions

The implementation necessitated adaptation of the original design, which allowed for bet-ter algorithms to be developed. These algorithms are based on the ideas used in theoriginals, but puts the ideas together in a slightly different way which is more appropriatefor the data being processed. Through this procedure, we ended up with two viable algo-rithms the SumThreshold algorithm and the variable window algorithm. These algorithmswere then thoroughly checked, as is discussed in the next chapter.

Chapter 5

Validation

Here we discuss validation of the two algorithms developed for RFI detection, the Sum-Threshold method and the variable window method. In running simulations we are able toaccurately determine which RFI each algorithm is able to detect, and to what extent thatRFI is detected. We are also able to determine the sensitivity of each algorithm, and theaccuracy in the flagging, which will give us a feel for when we can expect false positivesfrom the algorithm. The tests discussed in this chapter contain the most importantinformation found through the simulations. Further test results are provided in AppendixA.

5.1 Methods

To check that the output is correct, we create specific test data containing values similarto the real input data, containing RFI signals in known positions. This is done throughgenerating Gaussian noise with fake RFI signals added in known places. If the implemen-tation correctly flags this data, it can be considered to be working correctly.

We made use of spline fitting and medians to smooth data. This gives a realisticsmooth surface which can be used to test as the data will be based on such a shape.

The method of smoothing the data was as follows:

Run a window across all data, finding the median.

Create a data file containing these median values.

Perform a Bivariate Spline on the data file, smoothing value: s=0.5

Save the new spline surface as the smoothed data surface.

On top of this smoothed surface, white noise is added, which emulates astronomicaldata, which is often treated as Gaussian noise[12].

We create a different type of surface to test methods on as well. This is done with aperfectly flat surface, where the baseline of the values is set to zero. We then set specificvalues to be RFI spikes which should be picked up by the detection method.

5.1.1 Determining success

To determine success we will first compare the results of running each algorithm overtime and frequency separately, we will then compare the results of the algorithms to eachother. We will also compare each algorithm to a kurtosis algorithm (supplied). We will

28

5.2. TESTS 29

then decide if the differences in performance allow for the combination of the algorithmsto create a better method, and include characterisation of the signals.

5.2 Tests

Figure 5.1: a) Data in the general shape of real data, but with RFI removed, and noiseadded. b) The mask produced by the SumThreshold Algorithm. c) The mask producedby the variable window algorithm.

Test one tested the algorithms on a non-uniform surface with no RFI. This was donewith data in the same shape as the real data, but which was smoothed and then hadnoise added, as can be seen in Fig 5.1a. The expected outcome for this test was twoperfectly empty masks. The SumThreshold method provided exactly that (Fig 5.1b), butthe variable window method has flagged areas of the data (Fig 5.1c).

The bands marked 2 and 3 in Figure 5.1c correspond to points where the data stepssteeply, suggesting that there is a weakness in the variable window method when thedata steps. This leads to the inclusion of false positives in the mask in these areas. Thismeans that the method should be validated either with another detection algorithm, orby observing the data. The bands marked 1 and 4 have less obvious causes, but the causeis similar. They are both on a steep upward slope of the data, and so the algorithm isvery sensitive to this type of change.

Test two is designed to test the sensitivity of both algorithms to narrow band, isolatedRFI. This is done by using the same surface as in test one, with a single frequency channelincluding RFI. This is shown in Figure 5.2a, at the point labelled RFI.

30 CHAPTER 5. VALIDATION

Figure 5.2: a) Data in the general shape of real data, with a single RFI spike, and noise.b) The mask produced by the SumThreshold Algorithm. c) The mask produced by thevariable window algorithm.

Figure 5.2 shows the increased sensitivity of the variable window method, as the singlespike is flagged in a very narrow band, whereas the SumThreshold method picks it upwith a wider band. This is because the SumThreshold method will be unable to pickup the spike until it’s window has expanded to a size larger than the width of the spike,and the entire spike is enclosed by the window. This shows that it would be possibleto increase the accuracy of the SumThreshold method by decreasing the step size of thewindow position, although this would also increase the run time.

The third test makes use of the second surface. Data with a baseline of zero, andGaussian noise is tested. The expected outcome of this test is that neither algorithm willflag any data points.

Figure 5.3 shows that this test produced the expected results. Both masks are com-pletely empty. This is a good thing, as it means that the algorithms are checking onlyfor signals which differ from the median value by a statistically significant amount. Thisis relevant as one of the earliest iterations of the development did not have this property,and would have found RFI in this surface.

The fourth test is designed to test the sensitivity of the algorithms to a group of spikesclose together. This stands the danger of being treated as noise with a very high standarddeviation by the algorithms. We expect that the SumThreshold method will flag theentire band in which the group is found, and the variable window method will flag theindividual spikes.

Figure 5.4b shows that the SumThreshold method did not flag any values. This

5.2. TESTS 31

Figure 5.3: a) Data with a baseline of zero, and noise. b) The mask produced by theSumThreshold Algorithm. c) The mask produced by the variable window algorithm.

Figure 5.4: a) Data with a baseline of zero, a family of spikes, and noise. b) The maskproduced by the SumThreshold Algorithm. c) The mask produced by the variable windowalgorithm.

means that the algorithm falls into the trap of treating a group of spikes as noise with avery high standard deviation. The variable window method, however, acts as expected,flagging something in the same place as the RFI. A zoomed in view (Fig 5.5) shows thatthe variable window in fact flagged exactly the RFI, and so created a distinctive stripedpattern.

The final test is designed to test the sensitivity of the algorithms in the time dimension,


Figure 5.5: a) A zoomed view of the family of spikes. b) a zoomed view of the stripesdisplayed by the variable window mask.

Figure 5.6: a) The data explored. b) The mask produced by the SumThreshold algorithm.c) The mask produced by the variable window algorithm

as there is some RFI which is visible only in the time domain. To perform this test Thesurface resembling real data is used, and three rows are seeded with RFI, uniformly alongthe row. This produces the three horizontal lines marked RFI in Figure 5.6a.

We expect that the SumThreshold algorithm will be unable to detect these lines be-

5.3. DISCUSSION 33

Figure 5.7: The SumThreshold mask searching for transient RFI

Figure 5.8: A complete mask, created by combining the SumThreshold (transposed andnot) and the variable window masks.

cause it processes the data set one row at a time. However, we shall test the performanceof the SumThreshold method on a transpose of the dataset, and expect to see that theRFI is detected as it is narrow band in the time domain. We expect also that the variablewindow method will successfully flag the three lines.

We see the results of this test in Figure 5.6. As expected, there is nothing flagged inFig 5.6b, which is the SumThreshold mask. We can see also that the variable windowmethod performed almost as expected. There are three lines marked RFI in Fig 5.6c,however, these lines have gaps in them, which seem to be related to the false positiveband just to their left.

Figure 5.7 shows the results of running the SumThreshold algorithm on the transposeddata. We can see that it performed as expected, flagging the three lines accurately. Wecan see also in Figure 5.8 that the two masks that detected the horizontal lines detectedthem in the same place.

5.3 Discussion

We can see through this validation process that both the SumThreshold and the variablewindow methods have some shortcomings. The SumThreshold method is not as sensitiveas it is expected to be, and does not deal appropriately with groups of RFI. The variablewindow method is perhaps too sensitive, as it is liable to show false positives when thedata increases steeply.

The overall sensitivity of the variable window method makes it a very good first methodto use for a broad understanding of where there is RFI in the data, but a second methodshould be used to validate the broader bands of flagged data, as this is where the falsepositives appear. The two methods are both capable of finding RFI in both the time


domain and the frequency domain, although the variable window method does so moreefficiently. The SumThreshold method, while requiring that the algorithm is run on thetranspose of the data, does find the time based RFI as accurately as it finds frequencybased RFI.

These tests reveal some of the characteristics of the RFI detected by the two algo-rithms. The SumThreshold algorithm detects predominantly isolated, narrowband RFI.The variable window method is very sensitive, and is able to detect almost any RFI, buthas some false positives which could be mistaken for broadband RFI. This can be usedwhen combating the source of the RFI. Knowing the type of the RFI being detected ishelpful in narrowing down the possible sources.

Chapter 6

Results

In this chapter we discuss some case studies for the two RFI detection algorithms de-veloped, to determine the qualitative difference in the algorithms. Each case highlightssome feature or difference in the algorithms. The first study is a standard case, with veryfew features in the time dimension and standard features in the frequency dimension.The next study adds RFI in the time dimension. The third study is a fairly arbitrarychoice of data, to show interesting effects. All three case studies are taken from real datacollected on the site of the MeerKAT telescope, and made available by the SKA offices.We then compare performance of the two algorithms, and relate this back to the analysisin Chapter 3 §3.4.

6.1 Case Study 1

In this instance we consider a fairly typical data set Figure 6.1aa. There are no majordiscrepancies in the time domain, and the RFI seen in the frequency domain is presentin most of the other data sets as well. The first thing to note is that there are some lineson the data image which are clearly RFI, One set of such lines is marked on the figure.These lines show up as much darker than the rest of the image as they have a higherintensity. There are also some broad bands where the intensity increases, these are notbroadband RFI, but rather “baseline wiggles”, also marked on the figure. These shouldnot be flagged, as they correspond to trends in the baseline noise, rather than unusualoccurrences.

The second mask (Figure 6.1ac), produced by the variable window algorithm, con-tains more flagged points. The first bar of masked points in the variable window mask(labelled 1) is not shared by the SumThreshold mask, and is also not visible in the data.The variable window method does create false positives under certain circumstances (seeChapter 5), and it is possible that this data includes those.

There follow after that some faint lines, marked 2. There are more of these lineson the variable window mask (Fig 6.1ac), but there are a few on the SumThresholdmask (Fig 6.1ab) as well. Those which are on both masks are tall thin isolated spikes.These are picked up very effectively by both algorithms, and can be used as a type ofcharacterisation, as we know that if both methods flag the spike it must be a narrow andisolated type of RFI.

The SumThreshold algorithm cannot detect a family of spikes (marked 3), becausethe system sees them as simply noise with a very high standard deviation. However, thevariable window method is able to pick them up, as a specific type of RFI, leaving a

35

36 CHAPTER 6. RESULTS

(a) a) The data explored. b) The mask produced by the SumThreshold algorithm. c)The mask produced by the variable window algorithm

(b) The SumThreshold mask searching for transient RFI

(c) A complete mask, created by combining the SumThreshold (transposed and not)and the variable window masks.

Figure 6.1: An ordinary data set with typical RFI in the frequency domain, and minimalRFI in the time domain, along with the masks produced by the algorithms designed.

distinctive pattern of stripes. This is one of the greatest failings of the SumThresholdmethod in its current form. A better version would be able to detect that the group of

6.2. CASE STUDY 2 37

frequencies contain RFI, even if the entire group must be flagged, not leaving the stripes,but rather a wide block.

The next point on the masks which draws attention is a dark stripe in the variablewindow mask which is entirely missing in the SumThreshold mask, and which shows onlyvery faintly in the data (marked 4). This is either another false positive, or it shows thesensitivity of the variable window method, that it is able to accurately pick up somethingwhich is so close to being hidden in the noise. We then come to the area of the datalabelled as a “baseline wiggle”. It is encouraging to see that neither algorithm has flaggedanything in this region, showing that the difference between RFI and signal fluctuationsis noticed.

We end on a high note, where there is a line that looks as if it runs through the entiretyof Fig 6.1a (marked 5). This is a line of RFI which has been picked up equally by bothmethods, and is obviously a strong point of RFI.

Overall this case study shows that the SumThreshold method is able to accuratelyflag isolated narrow-band RFI, whereas the variable window method is far more sensitive,and is able to flag wider band RFI, as well as groups of spikes, creating a striped patternin the mask.

6.2 Case Study 2

This study (Figure 6.2a), illustrates the sensitivity of the two algorithms to transient RFI,which is RFI in the time domain. The data set has two time steps in the last quarter(marked t) where all signals received were a much higher intensity than usual, creatingtwo dark horizontal lines in the data. The SumThreshold method is unable to detect thischange, as it processes the data one time step at a time. However, these same horizontallines are evident in the variable window mask (Fig 6.2ac). When the SumThresholdmethod is run on a transpose of the data matrix, it is able to find these two lines, as seenin Figure 6.2b. Looking at Figure 6.2c shows that the two methods have found roughlythe same transient RFI, while the variable window method remains more sensitive overall,there are some marks which have been picked up by the SumThreshold method and notthe variable window. These are lines which were not sufficiently smoothed out of theunderlying surface used as a reference by the variable window method.

Comparing Figure 6.2aa with Figure 6.1aa shows that there are many similar featuresin the two data sets, which is expected given the choice of data for the first case study.The first difference is a very dark, uneven band (marked 1 on Fig 6.2a). This bandcorresponds to a darker band in the data, but what is important to note is the unevennature. This unevenness is caused by changes in the signal over time. So this signal, atdifferent times, takes up a different number of bands in the frequency spectrum. Thesedifferences are interesting in terms of the characterisation of the RFI.

In the middle of the data both masks have a line that stops about a third of the waydown (marked 2). This is an example of transient RFI that can be picked up by bothsystems, as it is also flagged as RFI in the frequency domain. There are a few other suchpartial lines in the variable window mask, that are also picked up by the SumThreshold,although the lines are slightly shorter in the SumThreshold mask.

Overall, the variable window method is able to detect transient RFI (in the timedomain) more easily and effectively than the SumThreshold method. The SumThresholdmethod is able to detect lines of transient RFI only when the data is transposed beforethe algorithm is applied.





Figure 6.2: A data set with typical RFI in the frequency domain, and two lines of RFI inthe time domain, along with the masks produced by the algorithms designed.

6.3. CASE STUDY 3 39




Figure 6.3: An arbitrary data set which shows the necessity of the detection algorithmsto see all the RFI within the data, along with the masks produced by the algorithmsdesigned.


6.3 Case Study 3

The third study is a data set which has properties of both the previous studies. Thereare more RFI points in the frequency domain, but these do not cover the full range of thetime domain. This is where the RFI detection algorithms become necessary, because thedata looks very similar to that in the first and second study, but the masks are noticeablydifferent.

The first difference that should be noted is that there is more RFI flagged on bothmasks. Even though we can’t see that RFI clearly in the data, the fact that both methodshave picked up more, and in the same places, means that the spikes are definitely there.This is an important observation, as it makes the need for these algorithms clear. If therewas only visible RFI detected there would be no need for an automated system as wecould know easily which channels to ignore when using the collected data for science.

The next thing we notice is that in the first quarter of the variable window mask,there is a small black square (marked 1 in Figure 6.3a). This is an unusual occurrence, asit suggests that either there was a glitch in the software (unlikely, as this was not seen inthe validation) or there is a very square area of RFI in that particular spot. The perfectsquare makes it look as if a single window was completely flagged at that point, whichmakes it look as if there was some kind of drop in the standard deviation which meantthat more of the window was flagged than would have been expected.

Figure 6.3b shows what transients (RFI in the time domain) have been detected bythe SumThreshold method. These are different to the transients detected by the variablewindow method, but when we look at Figure 6.3c we can see that they are definitelyrelated. Thus combining the masks in this way is useful, particularly when otherwise theflags are almost invisible in the mask.

This data set gives a good idea of the usefulness of the detection algorithms. Thereare some transients in the data, as well as many unusual spikes in the frequency domain.The masks show that the detection algorithms are far more sensitive than an observerwould be.

6.4 Profiling

On average, it takes 20 minutes to run the variable window method (depending on CPUspeed) on a data set with 3600 rows and 14200 columns, which is the data collected onthe MeerKAT site for one hour. Each pass through the data takes less than 400 seconds,which is less than 6.7 minutes. On the same data it takes around 5.5 hours to run a singleinstance of the SumThreshold method, this is also dependant on how many other processesare using memory. If the system has to use swap (virtual memory) the process takes upto two hours longer due to the longer write time. This could easily be improved by usingparallel processing on GPUs, this is because the method processes a single row of data ata time, and each row has the exact same operations performed on it. If the SumThresholdmethod was parallelised appropriately the speed could be increased significantly. Eachrow takes around 5 seconds to process. This is using an intel i3-3120M CPU at 2.50GHz.A faster processor gives better times.

The two methods also have different memory requirements. The SumThreshold algo-rithm requires only 800MB of RAM to perform optimally, whereas the variable windowalgorithm requires at least 1.2GB. This is because the variable window method must loadone file more into memory than the SumThreshold method. This extra file is the RFI

6.5. DISCUSSION 41

and noise free surface to which the data is being fitted.

6.5 Discussion

From these three case studies, and the previous chapter (5) on validation, we can seethat the variable window method is much more sensitive, and much more efficient thanthe SumThreshold method, but they do detect different things. In general, for broaddetection of RFI, it is best to use the variable window method, and then if unsure of falsepositives, it is worth using a second method as well.

We have also seen through a thorough analysis that the variable window method isexpected to run significantly faster than the SumThreshold method. This is born up byexperiment, which shows the variable window method can run in under 20 minutes, whilstthe SumThreshold method takes over 5 hours.

Chapter 7

Conclusions and Future Work

This work developed and compared two RFI detection algorithms: the SumThresholdalgorithm, and the variable window algorithm. Qualitatively the variable window detec-tion method developed for this project is more sensitive and more time efficient than theSumThreshold method which was adapted from the LOFAR telescope in the Netherlands.The variable window is able to detect far more RFI than the SumThreshold method, aswell as more accurately flagging that which it does find. The SumThreshold method,however, has the advantage that it does not require an “expected” surface with which theactual data should be compared.

A thorough algorithmic analysis of the two methods shows that both run in a timewhich is linear in the input size, but has a non-trivial constant factor. This constantfactor is what must be optimised to achieve the best possible performance.

This project successfully adapted two algorithms to efficiently flag RFI, and includesome characterisation of said RFI. An efficient algorithm has been produced based onprevious ideas, and it has been seen that through using a number of different detectionmethods we can differentiate between the different types of RFI in the system.

Through the validation process a few new interesting issues were discovered. However,there was insufficient time to completely address these, and so possible fixes are left asfuture work.

There is a difficulty in finding low intensity broadband RFI. This is caused by the win-dow size not being enough wider than the section in which the RFI is found. This meansthat either the median value which is used in the threshold is too high, or the standarddeviation is too high. A possible method to fix this would be to ‘fit’ a baseline througha window, by assuming that the window is centred on the RFI, and then interpolating aline through the middle third (or some more appropriate amount) of the window, basedon the median values of the outer thirds. The comparison could then be done betweenthe median of the middle third, and this interpolated base line. This method would alsobe helpful in finding “families” of spikes, which are high RFI in many channels, closetogether, but not in all. This RFI looks like noise with a very high standard deviation.

There are other methods which could be explored for this detection process. Futurework could include exploring theses alternative options for more characterisation oppor-tunities.

42

Bibliography

[1] J. Antoni, “The spectral kurtosis: a useful tool for characterising non-stationarysignals,” Mechanical Systems and Signal Processing, vol. 20, pp. 282–307, 2004.

[2] I. Asimov, Eyes on the Universe. Andre Deutsch, 1975.

[3] P. Bolli, F. Gaudiomonte, F. Messina, R. Ambrosini, C. Bortolotti, and M. Roma,“The RFI monitoring systems for the Medicina and the Sardinia radio telescopes,”in PoS RFI2010, 2010, p. 29.

[4] A. Boonstra and S. Van der Tol, “Spatial filtering of interfering signals at the initiallow frequency array (lofar) phased array test station,” Radio science, vol. 40, no. 5,2005.

[5] R. Ekers and J. Bell, “Radio frequency interference,” arXiv preprint astro-ph/0002515, 2000.

[6] P. S. Foundation, “Python.org,” http://www.python.org//, 2014, accessed on Octo-ber 28, 2014.

[7] D. Hanekom, “General notices, notice 198 of 2014, government gazette,” 2014.

[8] ICASA, “Draft South African table of frequency allocations,” 2008.

[9] J. Kocz, F. Briggs, and J. Reynolds, “Radio frequency interference removal throughthe application of spatial filtering techniques on the parkes multibeam receiver,” TheAstronomical Journal, vol. 140, no. 6, pp. 2086–2094, 2010.

[10] F. J. Meyer, J. B. Nicoll, and A. P. Doulgeris, “Correction and characterizationof radio frequency interference signatures in l-band synthetic aperture radar data,”IEEE Transactions on Geoscience and Remote Sensing, vol. 51, pp. 4961–4972, 2013.

[11] G. M. Nita and D. E. Gary, “Statistics of the SK estimator,” in PoS RFI2010, 2010,p. 19.

[12] G. M. Nita, D. E. Gary, Z. Liu, G. J. Hurford, and S. M. White, “Radio frequencyinterference excision using spectral-domain statistics,” Publications of the Astronom-ical Society of the Pacific, vol. 119, no. 857, pp. 805–827, 2007.

[13] A. R. Offringa, A. G. de Bruyn, S. Zaroubi, and M. Biehl, “A LOFAR detectionpipeline and its first results,” in PoS RFI2010, 2010, p. 36.

[14] A. Offringa, A. de Bruyn, M. Biehl, S. Zaroubi, G. Bernardi, and V. Pandey, “Post-correlation radio frequency interference classification methods,” Monthly Notices ofthe Royal Astronomical Society, vol. 405, no. 1, pp. 155–167, 2010.

43

http://www.python.org//

44 BIBLIOGRAPHY

[15] A. Offringa, A. de Bruyn, and S. Zaroubi, “Post-correlation filtering techniques foroff-axis source and rfi removal,” Monthly Notices of the Royal Astronomical Society,vol. 422, no. 1, pp. 563–580, 2012.

[16] A. Offringa, A. de Bruyn, S. Zaroubi, G. van Diepen, O. Martinez-Ruby, P. Labropou-los, M. Brentjens, B. Ciardi, S. Daiboo, G. Harker et al., “The lofar radio environ-ment.” Astronomy & Astrophysics/Astronomie et Astrophysique, vol. 549, 2013.

[17] A. Offringa, J. Van de Gronde, and J. Roerdink, “A morphological algo-rithm for improving radio-frequency interference detection.” Astronomy & Astro-physics/Astronomie et Astrophysique, vol. 539, p. A95, 2012.

[18] R. Oliva, E. Daganzo, Y. H. Kerr, S. Mecklenburg, S. Nieto, P. Richaume, andC. Gruhier, “Smos radio frequency interference scenario: Status and actions takento improve the rfi environment in the 1400–1427-mhz passive band,” Geoscience andRemote Sensing, IEEE Transactions on, vol. 50, no. 5, pp. 1427–1439, 2012.

[19] Unknown, “Electromagnetic interference,” en.wikipedia.org/wiki/Electromagneticinterference, 2014, accessed on April 14, 2014.

[20] ——, “H5py.org,” http://www.h5py.org//, 2014, accessed on October 15, 2014.

[21] ——, “The history of the SKA project,” https://www.skatelescope.org/project/history-of-the-skaproject/, 2014, accessed on April 28, 2014.

[22] ——, “Numpy.org,” http://www.numpy.org/, 2014, accessed on October 15, 2014.

[23] ——, “Nyquist frequency,” en.wikipedia.org/wiki/Nyquist frequency, 2014, accessedon May 9, 2014.

[24] ——, “Square Kilometre Array (SKA) Africa,” www.ska.ac.za/index.php, 2014,accessed on April 28, 2014.

en.wikipedia.org/wiki/Electromagnetic_interference

en.wikipedia.org/wiki/Electromagnetic_interference

http://www.h5py.org//

https://www.skatelescope.org/project/history-of-the-skaproject/

https://www.skatelescope.org/project/history-of-the-skaproject/

http://www.numpy.org/

en.wikipedia.org/wiki/Nyquist_frequency

www.ska.ac.za/index.php

Appendices

45

Appendix A

Validation Results

Figure A.1: a) Data with a baseline of zero, and one small section shifted up. b) Themask produced by the SumThreshold Algorithm. c) The mask produced by the variablewindow algorithm. The small shift up is treated as a baseline wiggle by both algorithms.

46

47

Figure A.2: a) Data with a baseline of zero, very low noise, with a broadband signal.b) The mask produced by the SumThreshold Algorithm. c) The mask produced by thevariable window algorithm. The SumThreshold method is not sensitive to broadbandRFI.

48 APPENDIX A. VALIDATION RESULTS

Figure A.3: a) Data with a baseline of zero, low noise, with a broadband signal. b) Themask produced by the SumThreshold Algorithm. c) The mask produced by the variablewindow algorithm. The SumThreshold method is not sensitive to broadband RFI.

49

Figure A.4: a) Data with a baseline of zero, and one small section shifted up. b) Themask produced by the SumThreshold Algorithm. c) The mask produced by the variablewindow algorithm. The small shift up is treated as a baseline wiggle by both algorithms.

50 APPENDIX A. VALIDATION RESULTS

Figure A.5: a) Data with a baseline of zero, and one narrow spike. b) The mask pro-duced by the SumThreshold Algorithm. c) The mask produced by the variable windowalgorithm. The spike is accurately flagged by both methods.

Appendix B

SumThreshold

#! /usr/bin/python

import numpy as np

import h5py

import sys

import time

import math

#===============================================================#

# #

# SumThreshold #

# By Philippa Hillebrand #

# #

#===============================================================#

#calculate the standard deviation with a given mean (xbar) value

#there is a native thing in python3.3, but python3.3 breaks things

def stdev(array, xbar):

var = 0

count = 0

for i in range(array.shape[0]):

var += (array[i]-xbar)**2

count += 1

#end for i

var /= count

std = math.sqrt(var)

return std

#end stdev

#A method to calculate the chi value for a window. Essentially this

# calculates the median of a window with length ’length’, it then

# includes the standard deviation in the calculations.

def setChi(pos, array, length, run):

if(pos < length/2):

shortR = array[0:length]

51

52 APPENDIX B. SUMTHRESHOLD

elif(pos + length/2 + length%2 > array.size):

shortR = array[array.size - length:array.size-1]

else:

shortR = array[pos-length/2:pos+length/2 + length%2]

med = np.median(shortR)

sdev = stdev(shortR, med)

#med+= sdev

return (med,sdev)

#end setChi

#A method to check through a designated window for RFI, using the

# combinatorial method.

def windowFunction(pos, chi, length, array, mask, med):

z = 0

for i in range(pos - length/2 - length%2, pos + length/2):

if mask[i] == 0:

#Add the distance from each point to the threshold.

#below is negative, above is positive

z += (array[i] - chi)

#end if

#end for i

if z > 0:

#Flag this entire window

for i in range(pos - length/2 - length%2, pos + length/2):

mask[i] = 1

array[i] = med

#end for i

#end if

return mask

#end windowFunction

def main():

if len(sys.argv) <= 1:

print "Include a file as an argument"

quit()

#end if

fileName = sys.argv[1]

stf = ’_’.join([fileName,’mask.h5’])

data = h5py.File(fileName, ’r’)

mask = h5py.File(stf, ’w’)

#Open the data and load into memory

dset = data[’spectra’]

arr = dset[...]

#Initialise mask array to ones.

53

maskSet = np.zeros_like(arr)

totalLength = arr.shape[1]

maxRuns = 7

#Loop through the rows

for i in range(arr.shape[0]):

#Set window size and position

#initially

run = 0

length = 2

growthMult = 2

pos = length/2 + length % 2

chi = (0,0)

jumpSize = 2

#Which line of data am I working with

arri = arr[i,:]

while run < maxRuns:

#print run, jumpSize, length

while pos + length/2 <= totalLength:

#Set threshold value (chi)

if length < 100:

chi = setChi(pos, arri, 100, run)

else:

chi = setChi(pos, arri, length, run)

#Call detection function for window

maskSet[i,:] = windowFunction

(pos, chi[0] + (3+2/(run+1))*chi[1],

length, arri, maskSet[i,:], chi[0])

pos += jumpSize

#end while

length *= growthMult

jumpSize = int(length/2)

pos = length/2 + length % 2

run += 1

#end while

#end for i

#Close data files.

mset = mask.create_dataset("MaskData", data=maskSet)

data.close()

mask.close()

#end main

main()

Appendix C

Variable Window

#! /usr/bin/python

import numpy as np

import h5py

import sys

import math

import time

#=======================================================#

# #

# A variable window size #


# #

#=======================================================#

#calculate the standard deviation with a given mean (xbar) value

#there is a native thing in python3.3, but python3.3 breaks things

def stdev(array, xbar):

var = 0

count = 0

for i in range(array.shape[0]):

for j in range(array.shape[1]):

var += (array[i,j]-xbar)**2

count += 1

#end for j

#end for i

var /= count

std = math.sqrt(var)

return std

#end stdev

#calculate the sigma values for this specific data.

def calculateSigma(winD, winP, data, smooth):

smoothR = smooth[winP[0]-winD[0]/2:winP[0]+winD[0]/2,winP[1]

winD[1]/2:winP[1]+winD[1]/2]

54

55

dataR = data[winP[0]-winD[0]/2:winP[0]+winD[0]/2,winP[1]-


xbar = np.median(smoothR)

sigma = stdev(dataR, xbar)

return sigma

#end calculateSigma

def flagWindow(winD, winP, data, smooth, sigma, mask):

smoothR = smooth[winP[0]-winD[0]/2:winP[0]+winD[0]/2,winP[1]-


dataR = data[winP[0]-winD[0]/2:winP[0]+winD[0]/2,winP[1]-


maskR = mask[winP[0]-winD[0]/2:winP[0]+winD[0]/2,winP[1]-


xbar = np.median(smoothR)

for i in range(dataR.shape[0]):

for j in range(dataR.shape[1]):

if sigma > 0:

if dataR[i,j] - xbar > 5*sigma:

maskR[i,j] = 1

dataR[i,j] = smoothR[i,j]

#end if

else: print ’sigma too low’, sigma

#end for j

#end for i

mask[winP[0]-winD[0]/2:winP[0]+winD[0]/2,winP[1]-

winD[1]/2:winP[1]+winD[1]/2] = maskR

data[winP[0]-winD[0]/2:winP[0]+winD[0]/2,winP[1]-

winD[1]/2:winP[1]+winD[1]/2] = dataR

#end flagWindow

def varyWindow(prevSize, deltaSigma):

if deltaSigma >= 2:

newSize = (prevSize[0], 128)

elif deltaSigma >= 1:


else :


#end if

return newSize

#end varyWindow

def main():

56 APPENDIX C. VARIABLE WINDOW

if len(sys.argv) < 3:

print ’Include data and smoothing’

quit()

#end if

dfn = sys.argv[1]

sfn = sys.argv[2]

dataFile = h5py.File(dfn, ’r’)

smoothFile = h5py.File(sfn, ’r’)

maskFile = h5py.File(’varWin’+dfn+’_mask.h5’, ’w’)

dset = dataFile[’spectra’]

sset = smoothFile[’spectra’]

data = dset[...]

smooth = sset[...]

mask = np.zeros_like(smooth)

#Now actually do important algorithmic steps.

#Loop some number of times

for k in range(3):

trun = 1

dsig1 =0

dsig2 = 0

dsig3 = 0

winDim = (32,256)

moveFreq = winDim[1]/2

moveTim = winDim[0]/2

run = 1

winPos = (moveTim*trun, moveFreq * run)

st = time.clock()

#Now loop some other number of times, to move the window around

while winPos[0]+(winDim[0]/2)< data.shape[0]:

dsig1 =0

dsig2 = 0

dsig3 = 0

run = 1

#print ’1)’, run, trun, winPos

sigma_last = 500

#Window dimensions set as a tuple (t,f)

winDim = (32,256)

#calculate sigma

sigma_now = calculateSigma(winDim, winPos, data, smooth)

while winPos[1]+(winDim[1]/2) < data.shape[1]:

run += 1

moveFreq = winDim[1]/2


57

flagWindow(winDim, winPos, data, smooth, sigma_now, mask)

dsig1 = dsig2

dsig2 = dsig3

dsig3 = sigma_last - sigma_now

deltaSigma = (dsig1 + dsig2 + dsig3) / 3

winDim = varyWindow(winDim, deltaSigma)

sigma_last = sigma_now

sigma_now = calculateSigma(winDim, winPos, data, smooth)

#end inner loop

trun += 1

#window position, given at the centre pont of the window


#end outer loop

maskFile.create_dataset(’MaskData’, data=mask)

maskFile.close()

dataFile.close()

smoothFile.close()

#end main

main()

Appendix D

Supporting Code

D.1 SaveDataAsImage

#! /usr/bin/python

import numpy as np

import h5py

import sys

import matplotlib.image as mpimg

import Image

fn = sys.argv[1]

version = sys.argv[2]

f = h5py.File(fn, ’r’)

#data = f[’spectra’]

data = f[’MaskData’]

array = data[...]

array = array.astype(’uint8’)*255

im = Image.fromarray(array)

im.save(fn+version+’.png’)

D.2 transpose

#! /usr/bin/python

import numpy as np

import h5py

import sys

58

D.3. PLOTSTUFF 59

#===============================================================#

# #

# #

# A script to transpose data #


# #

# #

#===============================================================#

if len(sys.argv) <= 1:

print "Include a file as an argument"

#end if

fn = sys.argv[1]

newfn = fn + "_trans.h5"

data = h5py.File(fn,’r’)

trans = h5py.File(newfn, ’w’)

dset = data[’spectra’]

arr = dset[...]

arrt = arr.transpose()

tset = trans.create_dataset(’spectra’, data=arrt)

data.close()

trans.close()

#SumThreshold1T.py newfn

D.3 plotStuff

#! /usr/bin/python

import numpy as np

import h5py

import scipy

from matplotlib import pyplot as pp

import sys

dfn = sys.argv[1]

dataF = h5py.File(dfn, ’r’)

dset = dataF[’spectra’]

data = dset[4,:]

60 APPENDIX D. SUPPORTING CODE

pp.plot(data)

pp.show()

dataF.close()

D.4 makeSmooth

#! /usr/bin/python

import numpy as np

import h5py

import SumThreshold1T as st

import sys

fn = sys.argv[1]

orig = h5py.File(fn, ’r’)

flat = h5py.File(’flat’ + fn, ’w’)

dset = orig[’spectra’]

arr = dset[...]

flatR = np.zeros_like(arr)

for i in range(arr.shape[0]):

j = 0

while j < flatR.shape[1]:

flatR[i,j] = st.setChi(j, arr[i,:], 300, 1, 1)

j += 1

#end for j

#end for i

fset = flat.create_dataset(’spectra’, data=flatR)

orig.close()

flat.close()

D.5 noise

#! /usr/bin/python

import numpy as np

import random

import scipy

import matplotlib

from matplotlib import pyplot as pp

import h5py

D.5. NOISE 61

original = h5py.File(’flatter14.h5’, ’r’)

dset = original[’spectra’]

origR = dset[...]

#origR = np.zeros((3600,14200))

ave = np.zeros(origR.shape[1])

newR = np.empty_like(origR)

for i in range(origR.shape[1]):

ave[i] = np.mean(origR[:,i])

#end for i

for j in range(3600):

array = np.random.normal(0, 0.2, origR.shape[1])

if j == 60 or j == 560 or j == 1060:

newR[j,:] = np.zeros_like(array)

else:

for i in range(origR.shape[1]):

#if i >= 5500 and i <= 5510:

#if i== 2500:

# if i%2 == 0:

# newR[j,i] = array[i] + ave[i] + 10

# else:

# newR[j, i] = array[i] + ave[i]

#end if

# newR[j,i] = array[i] + ave[i] + 0.7

#end if

#else:

newR[j, i] = array[i] + ave[i]

#end for i

#end for j

newF = h5py.File(’realTime_60*3-500-02.h5’,’w’)

newF.create_dataset(’spectra’, data=newR)

newF.close()

original.close()

Date post:	21-Jun-2020
Category:	Documents
Upload:	others
View:	4 times
Download:	0 times

Detection and Visualisation of Radio Frequency ...gnothnagel/downloads/... · nomical signals. A...

Documents