Science in the era of Gaia data - Centre for Astrophysics ......- Pedagogy of data analysis, when...

Science in the era of Gaia data

Andy Casey Astrophysics; Statistics

“big”

andycasey astrowizicist astrowizici.st

http://astrowizici.st

- The Gaia mission

All about Gaia.

What makes data big?

Science in the era of Gaia data“big

”

Andy Casey

- The Gaia mission

All about Gaia.


- Pedagogy of data analysis, when you have lots of data

Examples of how pedagogy drives decisions in big and small data analysis (data-driven methods, non-parametric models)


”

Andy Casey

- The Gaia mission

All about Gaia.




- Tools & resources for data analysis: pick the right tool for the job


”

Andy Casey

- The Gaia mission

All about Gaia.




- Tools & resources for data analysis: pick the right tool for the job

- Unsolicited advice to be ahead of the data wave


”

Andy Casey

having data is no longer currency in astronomy

Andy Casey

having data is no longer currency in astronomy

and the ability to effortlessly use data is currencyhaving good ideas

Andy Casey

having data

This talk is about making you rich

is no longer currency in astronomy

and the ability to effortlessly use data is currencyhaving good ideas

Andy Casey

The Gaia satelliteThe Billion Star Surveyor(tm) — One billion stars for one billion Euros

An astrometric mission designed to measure the position, parallax, brightness, and proper motions for

more than one billion stars.

Andy Casey


Andy Casey

An astrometric mission designed to measure the position, parallax, brightness, and proper motions for

more than one billion stars.


- Positions - Proper motions - Radial velocities (and scatter) - Parallax - Photometry (G, BP, RP) - Colours (G-BP, G-RP, BP-RP) - Dust along line of sight - Stellar effective temperatures - Stellar radii - Stellar masses - Stellar luminosities - Astrometric excess noise (more than a single-star solution) - Orbital solutions for solar system objects - Variable stars (including light curves of new kinds of objects)

For up to 1.7 billion sources:

Credit: Erik Tollerud

Source count completenessGaia observes everything. Stars, galaxies, quasars, asteroids, et cetera.

Andy Casey

Photometric performanceKepler-precision photometry, but for one billion stars

Andy Casey

Astrometric performance(It is very good)

Andy Casey

Astrometric performanceNote: “Hipparchus data release 1”

Proper motion performanceG ~ 18 star at 30 kpc w/ 0.4 mas/yr is approx. 2 km/s precision at 100,000 light years away

Andy Casey

You are here.

Andy Casey

Credit: S. BRUNIER/ESO/ESA

Andy Casey

Gaia Data Release 2This was the first “real” data release, and just averaged values.

Andy Casey

Gaia Data Release 2This was the first “real” data release, and just averaged values.

are we at“big data”

yet?

Andy Casey

Position measurements 128 trillion

Brightness measurements 380 trillion

Medium-resolution spectra 1 billion

Low-resolution spectra 100 billion

Size of reduced data products for science 1 petabyte

Gaia Data Release 5The flood is coming. This is what we need to deal with (easily).

Andy Casey







Andy Casey


yet?







If you can load it into RAM, then you are not at big data.

rule of thumb:

Andy Casey


yet?

Five pedagogical questions to ask yourself

1. Do you have small data, or do you have big data?

2. What is the simplest, dumbest model you can think of?

3. What assumptions are you making?

4. What is the utility of the model?

5. What can you afford?

to keep you out of scientific and data analysis cul-de-sacs

Andy Casey


If you can’t load it into RAM, you have options (in terms of difficulty):

• Do you need to load all the data at once?

• Memory-mapped arrays: store data on external hard drives and treat it

(really carefully) as memory.

• Can you subsample the data and get a comparable result?

• Can you use statistics of the data to get a comparable result?

• Can you simplify the data you use and get a comparable result (e.g.,

ignore covariances)?

• Can you recast your problem as a map-reduce problem?Andy Casey


Always start with the simplest model you can think of, even if you

“know” it is dumb and will not give you great results. For example:

1. Linear regression (for fitting data) — a design matrix can have non-

linear entries, but you are still doing linear regression!

2. k-means (for clustering) — use k-means++ for initialisation, “always”

3. Logistic regression (for classification)

Don’t change this model until you have answered all five questions!

When complicated models aren’t working correctly, always ask what is the

simplest, dumbest thing is that you could test to check your intuition.

Andy Casey


You have made an infinite number of assumptions.

What are the most important assumptions?

(Seriously, write them down)

• Do you assume that your data are drawn from a straight line?

• Do you assume the data points are independent?

• Do you assume the noise in the data are normally distributed?

• Do you assume that you have the correct objective function?

• Do you assume that you have optimised to the global minimum?

• Do you assume that you have used an appropriate optimisation algorithm?

• Do you assume that the noise estimates you have are correct?

• Do you assume that we do not live in a simulation? (Would it matter?)Andy Casey


All models are wrong, some are useful.

Even a dumb model can tell you a lot about what you should do next. If you

have a dumb model but you parameterise your model errors, then the model

errors (or residuals from the data) will inform you where your model is failing.

• Do the underlying physical models make good predictions?

• Under what conditions will this model fail? (models should fail loudly!)

• Do you need a point estimate of your model parameters, or do you need a

posterior probability distribution over data?

• Does this model give a point estimate that you can use for other purposes?

Andy Casey


Sometimes a point estimate of the parameters of a very simple model is

good enough to answer the question you have.

Sometimes you will need to sample a posterior probability distribution of a

complicated model. Or worse: calculate the fully marginalised likelihood (FML;

a.k.a. the “evidence”).

What can you afford? (etc.)

Answers to these questions will (in a very practical sense) help drive your

model complexity.

Andy Casey

Example: data-driven modelsFor when the data are better than the models.

Hierarchical data-driven models of stellar propertiesHierarchical, complex

model

Analytic integrals to marginalise parameters

Tractable!-ish

Use joint information between stars to de-

noise properties of the sample

arXivs: 1703.08112, 1706.05055 (Leistedt et al. and Anderson et al.)

Hierarchical data-driven models of stellar properties

arXivs: 1703.08112, 1706.05055 (Leistedt et al. and Anderson et al.)

Hierarchical, complex model

Analytic integrals to marginalise parameters

Tractable!-ish

Use joint information between stars to de-

noise properties of the sample

1. Do you have small data, or do you have big data? Small.

2. What is the simplest, dumbest model you can think of? Gaussian mixture model.

3. What assumptions are you making? Independence among stars. Many others.

4. What is the utility of the model? Most parallaxes are noisy. This model improves them.

5. What can you afford? Posterior distributions over data, but only through analytic marginalisation.

Hierarchical data-driven models of stellar properties

Example: non-parametric modelsTerribly named, because they really have infinite numbers of parameters.

1. Do you have small data, or do you have big data? Big. We ded.

2. What is the simplest, dumbest model you can think of? Mixture of two components.

3. What assumptions are you making? Some stars with similar colours and luminosity will be single stars.

4. What is the utility of the model? Point estimates of binary probability for two billion stars.

5. What can you afford? Posterior distributions over data, but only if we get clever.

Non-parametric model for binary star inference



radial velocity variance template systematics astrometric noise

bluer/redder than expected photometric variability

Fit a mixture model (normal and log-normal) to all observables of stars in

our “ball”

Calculate p(single|data) for the star of interest

Move on to the next…

105 106 107 108

apparent g flux

0.0

0.2

0.4

0.6

0.8

1.0

radia

lve

loci

tyva

rian

ce(k

ms�

1)

105 106 107 108

apparent bp flux

0.0

0.2

0.4

0.6

0.8

1.0

radia

lve

loci

tyva

rian

ce(k

ms�

1)

105 106 107 108

apparent rp flux

0.0

0.2

0.4

0.6

0.8

1.0

radia

lve

loci

tyva

rian

ce(k

ms�

1)


Non-parametric model for binary star inferenceIn practice we might want to sample the mixture parameters

for every star

Can we afford it?

Hell no!

We can barely optimise it!

But we may be able to analytically

marginalise out parameters that we

don’t care about


~210 million parameter model for brighter stars, about 1B parameter model for all stars.

Converted a “big data” problem to a “small data” problem that is embarrassingly parallel, and one

where we might be able to analytically marginalise out many hyper-nuisance-parameters.

10�4 10�3 10�2 10�1 100 101 102 103

K/Pp

1 � e2

10�4

10�3

10�2

10�1

100

101

102

103�

vrad

exce

ss/P

p1

�e2

0.0

0.2

0.4

0.6

0.8

1.0

bin

ary

pro

bab

ility


0.0 1.5 3.0 4.5 6.0

bp-rp

�4

0

4

8

12

abso

lute

Gm

agnitude

N = 6368651

0.0 1.5 3.0 4.5 6.0

bp-rp

�4

0

4

8

12

abso

lute

Gm

agnitude

0 1

binary fraction


Now we can do a population study of binary stars that is 105 times larger than anything we could do before.

Why not just turn on the Machine Learning(tm)?

As physicists we are often interested in the mechanisms that produced the data. That is, we want a generative model for the data.

Neural networks are universal function approximators (we’ve known that literally for decades), but they will not give you a generative model for the

data that is interpretable. This applies to most ML methods.

Sometime’s that’s OK. Sometimes you don’t care about interpretability, or how the data were generated. But often we do care, and we can afford an interpretable model, but we (incorrectly) opt to use Machine Learning.

Andy Casey

Why not just turn on the Machine Learning(tm)?

Consider a problem where there are:

• Lots of high quality data.

• It’s hard to model those data, and/or the existing models do not make

good predictions (“the data are better than the models”).

• We just want answers. We don’t care why.

Andy Casey

Why not just turn on the Machine Learning(tm)?Turn on the ML!

• Create some training set of well-known objects.

• Train a Convolutional Neural Network (CNN) to estimate the intrinsic (or latent)

properties of some objects, given an image (or spectrum) of the object.

• You responsibly run cross-validation (or drop-out) to convince yourself things

work.

• You run the test step.

• Your CNN has identified an object with properties that defy everything we

thought we knew about astrophysics! (But in many other ways, it is “similar

enough” to objects in the training set, so we have some reason to trust it)

Andy Casey

(Get it? Convolutional Neural Network.) Models that lack interpretability can really suck.

When should I turn on the Machine Learning(tm)?Can you write a generative model for the data (that evaluates in less than a Hubble time)?

Don’t use machine learning. Forward model the data.

Do you care about model interpretability, or interpreting the results that you get?

Don’t use machine learning. Forward model the data.

Do you want a posterior probability distribution over data? Don’t use machine learning. Forward model the data.

Do you need to retain some semblance of probability over data? Don’t use machine learning. Forward model the data.

Do you want to classify or estimate things, or make decisions, and you don’t care about the physics?

Hell yeah! Turn the Machine Learning up to 11!Andy Casey

Even when you turn on the Machine Learning(tm), the rules still apply!

Andy Casey

From Google on ‘Scalable and accurate deep learning with electronic health records’ (Nature):

Regularised logistic regression performed essentially just as well as Deep Neural Networks (mortality C.I. 0.81-0.89 vs 0.94 to 0.96).

Huge cost, complexity, and interpretability difference in those models.

What is the simplest, dumbest model you can think of?

Start with that.

https://www.nature.com/articles/s41746-018-0029-1.pdf

https://www.nature.com/articles/s41746-018-0029-1.pdf

Standard tools for data analysisLinear algebra. Go back to basics. Keep your linear (matrix) algebra sharp.

Python (3): astropy, numpy, scipy, scikit-learn, TensorFlow (not just for ML) Positives: Good glue. Human-readable, machine-executable. Transferable skill. Negatives: Only a little bit slow.

Stan: probabilistic programming language When to use: If you have a model that doesn’t have bespoke parts (e.g., no models at grid points, or functions that are not differentiable). When not to use: When your model contains bespoke parts. Or if statements (kinda).

Fortran/C: Betterise your code by speeding up the slowest parts. You can call Fortran or C functions directly from Python.

PostgreSQL: Learn it. Write scripts to ingest data. You will thank yourself later.

Hadoop: If you have a map-reduce job, use Hadoop. Transferable skill.Andy Casey

ResourcesStatistics: Information theory, inference and learning algorithms, Sokal’s notes, Probablistic Programming and Bayesian Methods for Hackers, Bayesian Data Analysis, Hamiltonian Monte Carlo

Version control: oh shit git

Machine Learning: Talking Machines, Which ML algorithm is for me?, Matrix calculus you need for deep learning, You should understand backpropagation, Machine Learning 101 (Google Engineers)

Code: astropy, tensorflow, stan, scikit-learn, fortran from python

Probabilistic graphical models: an introduction

Linear algebra: immersive linear algebra

Andy Casey

http://www.inference.org.uk/itila/p0.html

http://www.stat.unc.edu/faculty/cji/Sokal.pdf

http://www.stat.unc.edu/faculty/cji/Sokal.pdf

https://github.com/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers

http://www.stat.columbia.edu/~gelman/book/

http://arogozhnikov.github.io/2016/12/19/markov_chain_monte_carlo.html

http://ohshitgit.com/

https://www.thetalkingmachines.com/home?context_entity_type=node&context_entity_id=14033

https://blog.statsbot.co/machine-learning-algorithms-183cc73197c

http://explained.ai/matrix-calculus/index.html

https://medium.com/@karpathy/yes-you-should-understand-backprop-e2f06eab496b

https://medium.com/@karpathy/yes-you-should-understand-backprop-e2f06eab496b

https://docs.google.com/presentation/d/1kSuQyW5DTnkVaZEjGYCkfOxvzCqGEFzWBy4e9Uedd9k/preview?imm_mid=0f9b7e&cmp=em-data-na-na-newsltr_20171213&slide=id.g168a3288f7_0_58

http://astropy.org

http://tensorflow.com

http://mc-stan.org/

http://scikit-learn.org/

http://arogozhnikov.github.io/2015/11/29/using-fortran-from-python.html

https://blog.statsbot.co/probabilistic-graphical-models-tutorial-and-solutions-e4f1d72af189

http://immersivemath.com/ila/learnmore.html

Unsolicited advice to be ahead of the data wave

1. Create a GitHub or BitBucket account and use it. Push daily. Push good code. Push bad code. Push grant proposals. Push paper drafts. Push. Push. Push.

2. Read arXiv:1008.4686 and do all the exercises.

3. Be familiar with tools (machine learning, optimisation algorithms, linear algebra) and know how to chose the right tool. It’s hard.

4. Think about if you can map-reduce your data analysis problem. If you can, learn Hadoop as part of that project.

5. Start with the simplest model for data analysis. But for fun, think about how to fit a line to one petabyte of data.

Andy Casey

Gaia SprintsNot traditional scientific meetings.

Aim is to bring together people who want to exploit Gaia data on short timescales.

We do everything in the open. Open data. Open science.

No invited participants; everyone applies to attend (incl. the SOC, the Gaia principal investigator, etc).

“Best scientific experience of my life”, “Most important week of my year”.

Next Sprint: 2019 Santa Barbara

gaia.lolAndy Casey

http://gaia.lol

The data are only going to get bigger. Those who can’t swim, will drown.

Those who can swim will drown in .

Conclusions

Andy Casey

Conclusions






The data are only going to get bigger. Those who can’t swim, will drown.

Those who can swim will drown in .

Remember to ask yourself:

Andy Casey

Date post:	03-Aug-2020
Category:	Documents
Upload:	others
View:	2 times
Download:	0 times

Science in the era of Gaia data - Centre for Astrophysics ......- Pedagogy of data analysis, when...

Documents