+ All Categories
Home > Documents > Mining Massive Data Sets With CANFAR and Skytree › ai12 › pdfs › Ball_AI12.pdf · •CADC’s...

Mining Massive Data Sets With CANFAR and Skytree › ai12 › pdfs › Ball_AI12.pdf · •CADC’s...

Date post: 04-Jul-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
17
Mining Massive Data Sets With CANFAR and Skytree Nicholas M. Ball Canadian Astronomy Data Centre National Research Council Victoria, BC, Canada
Transcript
Page 1: Mining Massive Data Sets With CANFAR and Skytree › ai12 › pdfs › Ball_AI12.pdf · •CADC’s cloud computing system to provide a generic infrastructure for storage and processing

Mining Massive Data Sets With CANFAR and Skytree

Nicholas M. BallCanadian Astronomy Data CentreNational Research CouncilVictoria, BC, Canada

Page 2: Mining Massive Data Sets With CANFAR and Skytree › ai12 › pdfs › Ball_AI12.pdf · •CADC’s cloud computing system to provide a generic infrastructure for storage and processing

Collaborators

•David Schade (CADC)

•Alex Gray (Skytree and Georgia Tech)

•Martin Hack (Skytree)

... and many others

Page 3: Mining Massive Data Sets With CANFAR and Skytree › ai12 › pdfs › Ball_AI12.pdf · •CADC’s cloud computing system to provide a generic infrastructure for storage and processing

Me

Data miner who does astronomy

Astronomer who does data mining

Page 4: Mining Massive Data Sets With CANFAR and Skytree › ai12 › pdfs › Ball_AI12.pdf · •CADC’s cloud computing system to provide a generic infrastructure for storage and processing

Outline

•CANFAR

•Skytree

•Combining them: CANFAR+Skytree

•Using it

•Example Applications

•Conclusions

Page 5: Mining Massive Data Sets With CANFAR and Skytree › ai12 › pdfs › Ball_AI12.pdf · •CADC’s cloud computing system to provide a generic infrastructure for storage and processing

•CADC’s cloud computing system to provide a generic infrastructure for storage and processing

•Processing: 500 cores, nodes up to 6 processors and 32G memory (soon 256G)

•Storage: VOSpace, several hundred terabytes available, mounted filesystem

•User sees a virtual machine, on which one can install and run any Linux software, e.g., almost all astronomy code

•Run code interactively or in batch, via Condor

Page 6: Mining Massive Data Sets With CANFAR and Skytree › ai12 › pdfs › Ball_AI12.pdf · •CADC’s cloud computing system to provide a generic infrastructure for storage and processing

•7 well-known data mining algorithms (next slide)

•Fast implementations: N2 -> N

•Robust, proven accuracy (FASTlab) -> publication-quality results

•Academic and astronomy background

•Works as command line as part of one’s analysis

•E.g., input ASCII data, output and visualize results

Page 7: Mining Massive Data Sets With CANFAR and Skytree › ai12 › pdfs › Ball_AI12.pdf · •CADC’s cloud computing system to provide a generic infrastructure for storage and processing

Algorithm Description Runtime Notes

allkn All nearest neighbors O(N) (naive: N2)

kde Kernel density estimation O(N) (naive: N2)

svm Support vector machine Data-dependent

lr Linear regression Data-dependent

svd Singular value decomposition Data-dependent

kmeans K-means clustering Data-dependent

two_pt 2-point correlation function O(N) (naive: N2)

Page 8: Mining Massive Data Sets With CANFAR and Skytree › ai12 › pdfs › Ball_AI12.pdf · •CADC’s cloud computing system to provide a generic infrastructure for storage and processing

•Powerful system: Skytree on up to 500 cores in parallel

• Install on any VM with own code

•Access to VOSpace storage as mounted filesystem

•Analogy to CANFAR itself: generic infrastructure to facilitate science

•Extends CANFAR’s capability to enable data mining

Page 9: Mining Massive Data Sets With CANFAR and Skytree › ai12 › pdfs › Ball_AI12.pdf · •CADC’s cloud computing system to provide a generic infrastructure for storage and processing

How to Use CANFAR+Skytree

•Request a CANFAR account

•Register for a CADC account

•ssh to CANFAR, start a virtual machine

• Install Skytree on the virtual machine

•Add license server to path

•Run Skytree (next slide)

> Email [email protected]> Website login + password> ssh canfar.dao.nrc.ca> vmcreate <myvm>> vmssh <myvm>> tar -zxf <tarball> from http://www.skytreecorp.com> export SKYTREE_LICENSE_ PATH=@login-server-ip-address:/home/username/ .skytree/skytree-client.lic

Page 10: Mining Massive Data Sets With CANFAR and Skytree › ai12 › pdfs › Ball_AI12.pdf · •CADC’s cloud computing system to provide a generic infrastructure for storage and processing

Running Skytree

•Typical Skytree call looks like, e.g.:

• Interactive, or batch (up to 500 cores) as part of your data processing

> ./skytree-server allkn \ --references_in=datasets/sdss100kx4.skytree \ --k_neighbors=1 \ --distances_out=distances.out \ --indices_out=indices.out

Page 11: Mining Massive Data Sets With CANFAR and Skytree › ai12 › pdfs › Ball_AI12.pdf · •CADC’s cloud computing system to provide a generic infrastructure for storage and processing

Consultation

•The aim of the system is to enable better science

• If you have a problem to solve, send us an email, and we’ll work with you

•My background is astronomy, but I also know data mining

Page 12: Mining Massive Data Sets With CANFAR and Skytree › ai12 › pdfs › Ball_AI12.pdf · •CADC’s cloud computing system to provide a generic infrastructure for storage and processing

Example: Photo-zs for CFHTLS

•Own science interest is the galaxy luminosity function

•But legacy value of photo-zs

•Skytree allkn allows generation of full PDFs via perturbing inputs (Ball et al. 2008); and CANFAR allows comparison to template-based (Le Phare)

•Done for ~130 million CFHTLS galaxies (~26m x 5 detection passbands)

•100 perturbations: create and handle a catalogue of ~13 billion objects

•-> CANFAR+Skytree can process LSST-sized datasets

Page 13: Mining Massive Data Sets With CANFAR and Skytree › ai12 › pdfs › Ball_AI12.pdf · •CADC’s cloud computing system to provide a generic infrastructure for storage and processing

Photometric redshift0.7 0.75 0.8 0.85 0.9 0.95 1 1.05 1.1

PDF

2 .5

5

7.5

1 0

12.5

1 5

allkn photo-z instances fitted with kde

Page 14: Mining Massive Data Sets With CANFAR and Skytree › ai12 › pdfs › Ball_AI12.pdf · •CADC’s cloud computing system to provide a generic infrastructure for storage and processing

Example: Skytree Scaling

•Show that Skytree scales as claimed on real astronomy data

•Do we really get N2 -> N?

•Compare to open-source alternative, R

•Run algorithms on large catalogues: 100 million objects or greater

•E.g., 2MASS, CFHTLS, WISE, etc.

•Do useful investigations, e.g., find outliers

Page 15: Mining Massive Data Sets With CANFAR and Skytree › ai12 › pdfs › Ball_AI12.pdf · •CADC’s cloud computing system to provide a generic infrastructure for storage and processing

Work in progress

Page 16: Mining Massive Data Sets With CANFAR and Skytree › ai12 › pdfs › Ball_AI12.pdf · •CADC’s cloud computing system to provide a generic infrastructure for storage and processing

Work in progress

Page 17: Mining Massive Data Sets With CANFAR and Skytree › ai12 › pdfs › Ball_AI12.pdf · •CADC’s cloud computing system to provide a generic infrastructure for storage and processing

Conclusions

•CANFAR: storage, processing, analysis, generic, with own code

•Skytree: fast, robust -> publication-quality results

•CANFAR+Skytree: Skytree up to 500 cores, combine with own code, access to VOSpace

•To get started, email [email protected] (or talk to me!)

•For more information: Poster, or https://sites.google.com/site/nickballastronomer

We encourage interested users!


Recommended