Mining the MACHO dataset Markus Hegland, Mathematical Sciences Institute, ANU Margaret Kahn, ANU...

Mining the MACHO dataset

Markus Hegland, Mathematical Sciences Institute, ANU

Margaret Kahn, ANU Supercomputer Facility

• The MACHO project

• Woods data set

• Data exploration and data properties

• Data preprocessing

• Feature sets

• Classification using additive models

• Training process

• Web site

The MACHO Project

• To find evidence of dark matter from gravitational lensing effect

• Observations at Mt Stromlo 1992 - 1999

• 6.10^7 observed stars

• 100000 CDD images

Woods Data Set

• 792 stars identified as long period variable• Chosen from the full MACHO data set• Original data processed by SODOPHOT to give

red and blue light curves• Missing data• Large errors• Unequal sampling

Stars from the Woods data set

Two typical long-period stars

Data Preprocessing

• Data sampling is not uniform so cannot use Fourier transforms.

• Periodic stars satisfy f(t+p) = f(t) for some period p, say.

• Long period variable starts are not exactly periodic e.g. f(t)=f(t+p)+g(t) where g is small compared with f.

• Use “periodic smoothing” to estimate missing data.

Periodic Smoothing

An estimate for f can be determined by minimizing the function

€

J( f ) = ( f (t i

i=1

n

∑ ) − y i)2 + α ′ ′ f (t)2∫ dt

+β ( f (t + p)∫ − f (t))2 dt

The function is f is modeled as a piecewise linear function. In practice p is not known but it can be estimated by a methodsuch as Pisarenko’s method. For now the second penaltyfunction multiplier is much smaller than the first.

Feature Sets• Features are calculated to characterize the light

curves.• Magnitudes are observed for both red and blue

frequency range.• The difference between these is the logarithm of

the ratio of intensities of blue and red light. Called the colour index.

• Summary features of the light curves are obtained from the colour and magnitudes by forming the average (or median) over time, the amplitude of the fluctuations, the average frequency or time scale and a measure of the time scale fluctuations.

Features cont’d.

• Correlation between red and blue magnitudes.

• 9 features calculated and stored for each light curve.

• Use these features as predictor variables for the classifier, NOT original light curve data.

Classification using additive models

ANOVA decomposition, Friedman (MARS 1991), Hastie-Tibshirani (GAM 1990), Wahba 1990

For example, such a function could approximate a classification function

€

f (x) = logP(Y =1 | x)

P(Y = 0 | x)€

f (x) = f0 + f i∑ (x i) + f i, j (x i∑ , x j ) + ...

to decide which of two classes (0 or 1) a particular star belongs.

Additive Models

• In general an additive model is expressed as a sum of unknown smooth functions that have to be estimated from the data.

• The model is fitted by using a local scoring algorithm which iteratively fits weighted additive models by a back-fitting algorithm.

• This is a Gauss-Seidel method which iteratively smooths partial residuals by weighted linear least squares.

Possible basis functions for the approximation space in 1D.

Indicator functions Hat functions Hierarchical hat fns

ADDFIT uses 1D basis functions

Boosting

• Boosting is a machine learning procedure which improves the accuracy of any learning algorithm.

• The AdaBoost procedure used in this code calls a weak learning procedure several times and maintains a distribution of weights over the training set.

• Initially all weights are zero but then weights of incorrectly classified examples are increased so that the weak learner concentrates more on them.

Training Program

• Start with an initial training set of “accepted” stars, that is, stars of the type of interest.

• Helpful to also have a set of “unacceptable” stars to help the trainer.

• Additive models are used to form a classification function using the feature set data from the initial training set.

• This function is then applied to the full data set and the stars ordered based on the function values.

• The light curves are displayed in decreasing

order of function value. Ideally the training set stars should appear first.

• Further “acceptable” and “unacceptable” stars can be chosen by clicking on the relevant button and then a new classification carried out.

• Continue the process until satisfied with the star sorting.

Web based data mining tool

http://datamining.anu.edu.au

Software link to Macho demo.

This software contains Python code to read ASCII star data files, process them by removing any with insufficient good data thencalculate several features from each star. These features are thenused for the training program to select groups of like stars.

The programs have incorporated a method of caching data so that it is kept in binary form for quicker access. The caching software was written by Ole Nielsen and can be downloaded from the ANU datamining web page.

http://datamining.anu.edu.au/

Procedure to run:

• Determine initial training set.• When prompted enter the star numbers for

acceptable stars. Stars 1 and 2 are already entered as a default.

• When the web browser appears with the top ranked 60 stars, those that have already been deemed acceptable will have the “accept” button disabled and those that have been rejected will have the “reject” button disabled.

• The user can then choose more acceptable or unacceptable stars by clicking on the relevant button. Previous decisions can be changed.

• After choosing a few stars click on the “continue” button to see the next 60 top ranked stars or go down to further pages to make more choices.

• Continue until satisfied with the initial ranked stars. Stop by clicking “quit” or “restart”.

Date post:	27-Mar-2015
Category:	Documents
Upload:	ella-reid
View:	214 times
Download:	0 times

Mining the MACHO dataset Markus Hegland, Mathematical Sciences Institute, ANU Margaret Kahn, ANU...

Documents