Science in the era of Gaia data
Andy Casey Astrophysics; Statistics
“big”
andycasey astrowizicist astrowizici.st
- The Gaia mission
All about Gaia.
What makes data big?
Science in the era of Gaia data“big
”
Andy Casey
- The Gaia mission
All about Gaia.
What makes data big?
- Pedagogy of data analysis, when you have lots of data
Examples of how pedagogy drives decisions in big and small data analysis (data-driven methods, non-parametric models)
Science in the era of Gaia data“big
”
Andy Casey
- The Gaia mission
All about Gaia.
What makes data big?
- Pedagogy of data analysis, when you have lots of data
Examples of how pedagogy drives decisions in big and small data analysis (data-driven methods, non-parametric models)
- Tools & resources for data analysis: pick the right tool for the job
Science in the era of Gaia data“big
”
Andy Casey
- The Gaia mission
All about Gaia.
What makes data big?
- Pedagogy of data analysis, when you have lots of data
Examples of how pedagogy drives decisions in big and small data analysis (data-driven methods, non-parametric models)
- Tools & resources for data analysis: pick the right tool for the job
- Unsolicited advice to be ahead of the data wave
Science in the era of Gaia data“big
”
Andy Casey
having data is no longer currency in astronomy
Andy Casey
having data is no longer currency in astronomy
and the ability to effortlessly use data is currencyhaving good ideas
Andy Casey
having data
This talk is about making you rich
is no longer currency in astronomy
and the ability to effortlessly use data is currencyhaving good ideas
Andy Casey
The Gaia satelliteThe Billion Star Surveyor(tm) — One billion stars for one billion Euros
An astrometric mission designed to measure the position, parallax, brightness, and proper motions for
more than one billion stars.
Andy Casey
The Gaia satelliteThe Billion Star Surveyor(tm) — One billion stars for one billion Euros
Andy Casey
An astrometric mission designed to measure the position, parallax, brightness, and proper motions for
more than one billion stars.
The Gaia satelliteThe Billion Star Surveyor(tm) — One billion stars for one billion Euros
- Positions - Proper motions - Radial velocities (and scatter) - Parallax - Photometry (G, BP, RP) - Colours (G-BP, G-RP, BP-RP) - Dust along line of sight - Stellar effective temperatures - Stellar radii - Stellar masses - Stellar luminosities - Astrometric excess noise (more than a single-star solution) - Orbital solutions for solar system objects - Variable stars (including light curves of new kinds of objects)
For up to 1.7 billion sources:
Credit: Erik Tollerud
Source count completenessGaia observes everything. Stars, galaxies, quasars, asteroids, et cetera.
Andy Casey
Photometric performanceKepler-precision photometry, but for one billion stars
Andy Casey
Astrometric performance(It is very good)
Andy Casey
Astrometric performanceNote: “Hipparchus data release 1”
Proper motion performanceG ~ 18 star at 30 kpc w/ 0.4 mas/yr is approx. 2 km/s precision at 100,000 light years away
Andy Casey
You are here.
Andy Casey
Credit: S. BRUNIER/ESO/ESA
Andy Casey
Gaia Data Release 2This was the first “real” data release, and just averaged values.
Andy Casey
Gaia Data Release 2This was the first “real” data release, and just averaged values.
are we at“big data”
yet?
Andy Casey
Position measurements 128 trillion
Brightness measurements 380 trillion
Medium-resolution spectra 1 billion
Low-resolution spectra 100 billion
Size of reduced data products for science 1 petabyte
Gaia Data Release 5The flood is coming. This is what we need to deal with (easily).
Andy Casey
Position measurements 128 trillion
Brightness measurements 380 trillion
Medium-resolution spectra 1 billion
Low-resolution spectra 100 billion
Size of reduced data products for science 1 petabyte
Gaia Data Release 5The flood is coming. This is what we need to deal with (easily).
Andy Casey
are we at“big data”
yet?
Position measurements 128 trillion
Brightness measurements 380 trillion
Medium-resolution spectra 1 billion
Low-resolution spectra 100 billion
Size of reduced data products for science 1 petabyte
Gaia Data Release 5The flood is coming. This is what we need to deal with (easily).
If you can load it into RAM, then you are not at big data.
rule of thumb:
Andy Casey
are we at“big data”
yet?
Five pedagogical questions to ask yourself
1. Do you have small data, or do you have big data?
2. What is the simplest, dumbest model you can think of?
3. What assumptions are you making?
4. What is the utility of the model?
5. What can you afford?
to keep you out of scientific and data analysis cul-de-sacs
Andy Casey
1. Do you have small data, or do you have big data?
If you can’t load it into RAM, you have options (in terms of difficulty):
• Do you need to load all the data at once?
• Memory-mapped arrays: store data on external hard drives and treat it
(really carefully) as memory.
• Can you subsample the data and get a comparable result?
• Can you use statistics of the data to get a comparable result?
• Can you simplify the data you use and get a comparable result (e.g.,
ignore covariances)?
• Can you recast your problem as a map-reduce problem?Andy Casey
2. What is the simplest, dumbest model you can think of?
Always start with the simplest model you can think of, even if you
“know” it is dumb and will not give you great results. For example:
1. Linear regression (for fitting data) — a design matrix can have non-
linear entries, but you are still doing linear regression!
2. k-means (for clustering) — use k-means++ for initialisation, “always”
3. Logistic regression (for classification)
Don’t change this model until you have answered all five questions!
When complicated models aren’t working correctly, always ask what is the
simplest, dumbest thing is that you could test to check your intuition.
Andy Casey
3. What assumptions are you making?
You have made an infinite number of assumptions.
What are the most important assumptions?
(Seriously, write them down)
• Do you assume that your data are drawn from a straight line?
• Do you assume the data points are independent?
• Do you assume the noise in the data are normally distributed?
• Do you assume that you have the correct objective function?
• Do you assume that you have optimised to the global minimum?
• Do you assume that you have used an appropriate optimisation algorithm?
• Do you assume that the noise estimates you have are correct?
• Do you assume that we do not live in a simulation? (Would it matter?)Andy Casey
4. What is the utility of the model?
All models are wrong, some are useful.
Even a dumb model can tell you a lot about what you should do next. If you
have a dumb model but you parameterise your model errors, then the model
errors (or residuals from the data) will inform you where your model is failing.
• Do the underlying physical models make good predictions?
• Under what conditions will this model fail? (models should fail loudly!)
• Do you need a point estimate of your model parameters, or do you need a
posterior probability distribution over data?
• Does this model give a point estimate that you can use for other purposes?
Andy Casey
5. What can you afford?
Sometimes a point estimate of the parameters of a very simple model is
good enough to answer the question you have.
Sometimes you will need to sample a posterior probability distribution of a
complicated model. Or worse: calculate the fully marginalised likelihood (FML;
a.k.a. the “evidence”).
What can you afford? (etc.)
Answers to these questions will (in a very practical sense) help drive your
model complexity.
Andy Casey
Example: data-driven modelsFor when the data are better than the models.
Hierarchical data-driven models of stellar propertiesHierarchical, complex
model
Analytic integrals to marginalise parameters
Tractable!-ish
Use joint information between stars to de-
noise properties of the sample
arXivs: 1703.08112, 1706.05055 (Leistedt et al. and Anderson et al.)
Hierarchical data-driven models of stellar properties
arXivs: 1703.08112, 1706.05055 (Leistedt et al. and Anderson et al.)
Hierarchical, complex model
Analytic integrals to marginalise parameters
Tractable!-ish
Use joint information between stars to de-
noise properties of the sample
1. Do you have small data, or do you have big data? Small.
2. What is the simplest, dumbest model you can think of? Gaussian mixture model.
3. What assumptions are you making? Independence among stars. Many others.
4. What is the utility of the model? Most parallaxes are noisy. This model improves them.
5. What can you afford? Posterior distributions over data, but only through analytic marginalisation.
Hierarchical data-driven models of stellar properties
Example: non-parametric modelsTerribly named, because they really have infinite numbers of parameters.
1. Do you have small data, or do you have big data? Big. We ded.
2. What is the simplest, dumbest model you can think of? Mixture of two components.
3. What assumptions are you making? Some stars with similar colours and luminosity will be single stars.
4. What is the utility of the model? Point estimates of binary probability for two billion stars.
5. What can you afford? Posterior distributions over data, but only if we get clever.
Non-parametric model for binary star inference
Non-parametric model for binary star inference
Non-parametric model for binary star inference
radial velocity variance template systematics astrometric noise
bluer/redder than expected photometric variability
Fit a mixture model (normal and log-normal) to all observables of stars in
our “ball”
Calculate p(single|data) for the star of interest
Move on to the next…
105 106 107 108
apparent g flux
0.0
0.2
0.4
0.6
0.8
1.0
radia
lve
loci
tyva
rian
ce(k
ms�
1)
105 106 107 108
apparent bp flux
0.0
0.2
0.4
0.6
0.8
1.0
radia
lve
loci
tyva
rian
ce(k
ms�
1)
105 106 107 108
apparent rp flux
0.0
0.2
0.4
0.6
0.8
1.0
radia
lve
loci
tyva
rian
ce(k
ms�
1)
Non-parametric model for binary star inference
Non-parametric model for binary star inferenceIn practice we might want to sample the mixture parameters
for every star
Can we afford it?
Hell no!
We can barely optimise it!
But we may be able to analytically
marginalise out parameters that we
don’t care about
Non-parametric model for binary star inference
~210 million parameter model for brighter stars, about 1B parameter model for all stars.
Converted a “big data” problem to a “small data” problem that is embarrassingly parallel, and one
where we might be able to analytically marginalise out many hyper-nuisance-parameters.
10�4 10�3 10�2 10�1 100 101 102 103
K/Pp
1 � e2
10�4
10�3
10�2
10�1
100
101
102
103�
vrad
exce
ss/P
p1
�e2
0.0
0.2
0.4
0.6
0.8
1.0
bin
ary
pro
bab
ility
Non-parametric model for binary star inference
0.0 1.5 3.0 4.5 6.0
bp-rp
�4
0
4
8
12
abso
lute
Gm
agnitude
N = 6368651
0.0 1.5 3.0 4.5 6.0
bp-rp
�4
0
4
8
12
abso
lute
Gm
agnitude
0 1
binary fraction
Non-parametric model for binary star inference
Now we can do a population study of binary stars that is 105 times larger than anything we could do before.
Why not just turn on the Machine Learning(tm)?
As physicists we are often interested in the mechanisms that produced the data. That is, we want a generative model for the data.
Neural networks are universal function approximators (we’ve known that literally for decades), but they will not give you a generative model for the
data that is interpretable. This applies to most ML methods.
Sometime’s that’s OK. Sometimes you don’t care about interpretability, or how the data were generated. But often we do care, and we can afford an interpretable model, but we (incorrectly) opt to use Machine Learning.
Andy Casey
Why not just turn on the Machine Learning(tm)?
Consider a problem where there are:
• Lots of high quality data.
• It’s hard to model those data, and/or the existing models do not make
good predictions (“the data are better than the models”).
• We just want answers. We don’t care why.
Andy Casey
Why not just turn on the Machine Learning(tm)?Turn on the ML!
• Create some training set of well-known objects.
• Train a Convolutional Neural Network (CNN) to estimate the intrinsic (or latent)
properties of some objects, given an image (or spectrum) of the object.
• You responsibly run cross-validation (or drop-out) to convince yourself things
work.
• You run the test step.
• Your CNN has identified an object with properties that defy everything we
thought we knew about astrophysics! (But in many other ways, it is “similar
enough” to objects in the training set, so we have some reason to trust it)
Andy Casey
(Get it? Convolutional Neural Network.) Models that lack interpretability can really suck.
When should I turn on the Machine Learning(tm)?Can you write a generative model for the data (that evaluates in less than a Hubble time)?
Don’t use machine learning. Forward model the data.
Do you care about model interpretability, or interpreting the results that you get?
Don’t use machine learning. Forward model the data.
Do you want a posterior probability distribution over data? Don’t use machine learning. Forward model the data.
Do you need to retain some semblance of probability over data? Don’t use machine learning. Forward model the data.
Do you want to classify or estimate things, or make decisions, and you don’t care about the physics?
Hell yeah! Turn the Machine Learning up to 11!Andy Casey
Even when you turn on the Machine Learning(tm), the rules still apply!
Andy Casey
From Google on ‘Scalable and accurate deep learning with electronic health records’ (Nature):
Regularised logistic regression performed essentially just as well as Deep Neural Networks (mortality C.I. 0.81-0.89 vs 0.94 to 0.96).
Huge cost, complexity, and interpretability difference in those models.
What is the simplest, dumbest model you can think of?
Start with that.
Standard tools for data analysisLinear algebra. Go back to basics. Keep your linear (matrix) algebra sharp.
Python (3): astropy, numpy, scipy, scikit-learn, TensorFlow (not just for ML) Positives: Good glue. Human-readable, machine-executable. Transferable skill. Negatives: Only a little bit slow.
Stan: probabilistic programming language When to use: If you have a model that doesn’t have bespoke parts (e.g., no models at grid points, or functions that are not differentiable). When not to use: When your model contains bespoke parts. Or if statements (kinda).
Fortran/C: Betterise your code by speeding up the slowest parts. You can call Fortran or C functions directly from Python.
PostgreSQL: Learn it. Write scripts to ingest data. You will thank yourself later.
Hadoop: If you have a map-reduce job, use Hadoop. Transferable skill.Andy Casey
ResourcesStatistics: Information theory, inference and learning algorithms, Sokal’s notes, Probablistic Programming and Bayesian Methods for Hackers, Bayesian Data Analysis, Hamiltonian Monte Carlo
Version control: oh shit git
Machine Learning: Talking Machines, Which ML algorithm is for me?, Matrix calculus you need for deep learning, You should understand backpropagation, Machine Learning 101 (Google Engineers)
Code: astropy, tensorflow, stan, scikit-learn, fortran from python
Probabilistic graphical models: an introduction
Linear algebra: immersive linear algebra
Andy Casey
Unsolicited advice to be ahead of the data wave
1. Create a GitHub or BitBucket account and use it. Push daily. Push good code. Push bad code. Push grant proposals. Push paper drafts. Push. Push. Push.
2. Read arXiv:1008.4686 and do all the exercises.
3. Be familiar with tools (machine learning, optimisation algorithms, linear algebra) and know how to chose the right tool. It’s hard.
4. Think about if you can map-reduce your data analysis problem. If you can, learn Hadoop as part of that project.
5. Start with the simplest model for data analysis. But for fun, think about how to fit a line to one petabyte of data.
Andy Casey
Gaia SprintsNot traditional scientific meetings.
Aim is to bring together people who want to exploit Gaia data on short timescales.
We do everything in the open. Open data. Open science.
No invited participants; everyone applies to attend (incl. the SOC, the Gaia principal investigator, etc).
“Best scientific experience of my life”, “Most important week of my year”.
Next Sprint: 2019 Santa Barbara
gaia.lolAndy Casey
The data are only going to get bigger. Those who can’t swim, will drown.
Those who can swim will drown in .
Conclusions
Andy Casey
Conclusions
1. Do you have small data, or do you have big data?
2. What is the simplest, dumbest model you can think of?
3. What assumptions are you making?
4. What is the utility of the model?
5. What can you afford?
The data are only going to get bigger. Those who can’t swim, will drown.
Those who can swim will drown in .
Remember to ask yourself:
Andy Casey