Date post: | 13-Jul-2015 |
Category: |
Technology |
Upload: | mongodb |
View: | 799 times |
Download: | 0 times |
About Us
We are Computer Scientists and Engineers at Lawrence Berkeley Lab
We work with science teams to help build software and computing infrastructure for doing awesome SCIENCE
1
Talk overview
Background: community science and data
science in general, materials in particular
How (and why) we use MongoDB today
Things we would like to do with MongoDB in the future
Conclusions
2
Big Data
Science is increasingly data-driven
Computational cycles are cheap
Take an –omicsapproach to science
Compute all interesting things first, ask questions later
4
The Materials Project
An open science initiative that makes available a huge database of computed materials properties for all materials researchers
5
Why we care about "materials"
solar PVelectric vehicles
other:waste heat recovery (thermoelectrics)hydrogen storagecatalysts/fuel cells
This is a Material!
https://www.materialsproject.org/materials/24972/{ "created_at": "2012-08-30T02:55:49.139558", "version": { "pymatgen": "2.2.1dev", "db": "2012.07.09", "rest": "1.0" }, "valid_response": true, "copyright": "Copyright 2012, The Materials Project", "response": [ { "formation_energy_per_atom": -1.7873700829440011, "elements": [ "Fe", "O" ], "band_gap": null, "e_above_hull": 0.0460096124999998, "nelements": 2, "pretty_formula": "Fe2O3", "energy": -132.33005625, "is_hubbard": true, "nsites": 20, "material_id": 542309, "unit_cell_formula": { "Fe": 8.0, "O": 12.0 }, ….
8
High-throughput computing is like an army
MATERIALS THEORY COMPUTERS
WORKFLOW
vs.idY({ri};t)
dt
= HÙ
Y({ri};t)
Do not synthesize!
Put on backburner
Begin further investigation
Infrastructure
SubmittedMaterials
MaterialsData
Materials Properties
Supercomputers
• Over 10 million CPU hours of calculations in < 6 months
• Over 40,000 successful VASP runs(30,000+ materials)
• Generalizable to other high-throughput codes/analyses
Calculation Workflows
Supercomputers Codes to run(in sequence)
Atomic positions
15
Application Stack
Python Django Web Service
Pymatgen and other scientific Python libraries
Fireworks + VASP
pymongo
MongoDB under all these layers
Currently at Mongo 2.2
17
Powered by MongoDB
Materials Project data stored in a MongoDB
Core materials properties
User generated data
Workflow and Job data
18
Scalability
We use replica sets in a master/slave configNo sharding (yet)But we are doing pretty well with a small MongoDB cluster so far4 nodes (2 prod, 2 dev)128 GB, 24 cores per node
Current Data volume30000 compounds – 250 GB
Eventuallymillions of compounds, big experimental data
19
Flexibility
Our data changes frequently
it's research
This was a major pain point for the old SQL schema
The flip side, chaos, has not been a big problem
in reality, all access is programmatic so the code implies the schema
21
Developer-friendly
We work in small groupsmix of scientists, programmers
need something easy to learn, easy to use
Structuring the data is intuitive
Querying the data is intuitive
Querying the data is easy to map to web interfacesDeveloped a special language "Moogle" that translates pretty cleanly to MongoDB filters
22
Search as JSON
{
"nelements": {"$gt": 3, "$lt": 5},
"elements": {
"$all": ["Li", "Fe", "O"],
"$nin": ["Na"]
}
}
24
JSON
JSON is easy (enough) for scientists to read and understandDirect translation between JSON and the database saves lots of timealso nearly-direct mapping to Python dict
For example: Structured Notation Languagethe format for describing the inputs of our computational jobs
Makes it very easy to build an HTTP API on our data Materials API
25
Read/Write Patterns
Scientific data usually generated in large runs
Must go through validation first
Core data only updated in bulk during controlled releases
Core data is essentially read-only
User-generated data is r/w but writes need not complete immediately
Workflow data is r/w and has some write issues, since timing matters
26
Our use of MongoDB
Use it directly, no mongomapper or other ORM-like layer
Did write a "QueryEngine" class
specific to our data, not a general ORM
27
Sample collections in DB
28
Collection Record size Count
diffraction_patterns 450KB 38621
materials 400KB 30758
tasks 150KB 71463
crystals 40KB 125966
Building materials from tasks
if ndocs == 1:
blessed_doc = docs[0]
# Add ntasks_id
# Add task_ids
blessed_doc['task_ids'] = [blessed_doc['task_id']]
blessed_doc['ntask_ids'] = 1
# only one successful result
# could be GGA or GGA+U
elif ndocs >= 2:
# multiple GGA and GGA+U runs
# select the GGA+U run if it exists
# else sort by free_energy_per_atom
....
29
Select the "blessed"
material based on domain
criteria
Data Examples – Fe2O3
https://www.materialsproject.org/materials/24972/{ "created_at": "2012-08-30T02:55:49.139558", "version": { "pymatgen": "2.2.1dev", "db": "2012.07.09", "rest": "1.0" }, "valid_response": true, "copyright": "Copyright 2012, The Materials Project", "response": [ { "formation_energy_per_atom": -1.7873700829440011, "elements": [ "Fe", "O" ], "band_gap": null, "e_above_hull": 0.0460096124999998, "nelements": 2, "pretty_formula": "Fe2O3", "energy": -132.33005625, "is_hubbard": true, "nsites": 20, "material_id": 542309, "unit_cell_formula": { "Fe": 8.0, "O": 12.0 }, ….
30
Mongo works for us
Adding new properties is commonneed flexible schema
Data is nested and structuredmore objects-like, less relational
Need a flexible query mechanism to be able to pull out relevant objects
Read-heavy, few writesCore data only updated in bulk during releases
User writes less critical, so we can survive hiccups
31
Schema-Last
Mongo provides flexible schemas
But web code expects data in a certain structure
M/R methodology allows us to distill the data into the schema expected by the code
Allows us to evolve schema while maintaining a more rigorous structure frontend
32
Fireworks - workflows
Keep all our state in MongoDBraw inputs (crystals) for the calculations
job specifications, ready and running
output of finished runs
Outputs are loaded in batches back to MongoDB
Real-time updates for workflow statusMongoDB "write concerns" are important here
33
Lots of stuff needs to be validated
Atomic compound is stable
Volume of lattice is positive
Time to compute was greater than some minimum
"Phase diagram" correct
Rough agreement across calculation types
Energies agree with experimental results
.....
34
Validation requirements
Fast enough to use in real-time
But not full MongoDB query syntax
too finicky, tricky esp. for "or" type of queries
Want other people to be able to use this
We don't want to become a "mongodb tutor"
Also is nice to be general to any document-oriented datastore
35
Our validation workflow
YAML constraint
specification
Build MongoDB
query
Determine which constraints failed for
each recordReport
coll.find()
Records
User
from ML,someday..
36
Example query
_aliases:
- energy = final_energy_per_atom
materials_dbv2:
-
filter:
- nelements = 4
constraints:
- energy > 0
- spacegroup.number > 0
- elements size$ nelements
37
{'nelements': 4, '$or': [
{'spacegroup.number': {'$lte': 0}},{'final_energy_per_atom': {'$lte': 0}},{'$where': 'this.elements.length != this.nelements'}
]}
$ mgvv -f mgvv-ex.yaml
MongoDB Query
Sandbox implementation
We pre-build the sandboxes the same way we pre-build all the other derived collections
We are using collections with "." to separate componentscore.<collection>, sandbox.<name>.<collection>
All access is mediated by our Django serverpermissions stored in Django, map users and groups to access to the sandboxes
No cross-collection querying is needed
40
Share data across disciplines
43
Use MongoDB as a flexible metadata store that connects data/metadata
Store and search user-generated ontologies
Organize and search PBs of data files
Currently store in file hierarchies
/projectX/experimentA/tuesday/run99/foobar.hdf5
Move the metadata to MongoDB
{'project':'X', 'experiment':'A', 'run':99,
'date':'2013-05-10T12:31:56', 'user':'joe', ....
'path':'/projectX/d67001A3.hdf5'}
Already doing this for some projects
one-offs: general layer possible?
44
Thank you
Contact info
Shreyas Cholia [email protected]
Dan Gunter [email protected]
Thanks to these people for slide material
Anubhav Jain, Kristin Persson, David Skinner (LBNL)
Thanks to Materials Project contributors
http://materialsproject.org/contributors
45