Date post: | 01-Feb-2016 |
Category: |
Documents |
Upload: | fedogor-fed |
View: | 218 times |
Download: | 0 times |
Introducing: MongoDB
David J. C. Beach
Sunday, August 1, 2010
David Beach
Software Consultant (past 6 years)
Python since v1.4 (late 90’s)
Design, Algorithms, Data Structures
Sometimes Database stuff
not a “frameworks” guy
Organizer: Front Range Pythoneers
Sunday, August 1, 2010
Outline
Part I: Trends in Databases
Part II: Mongo Basic Usage
Part III: Advanced Features
Sunday, August 1, 2010
Part I:Trends in Databases
Sunday, August 1, 2010
Database Trends
Past: “Relational” (RDBMS)
Data stored in Tables, Rows, Columns
Relationships designated by Primary, Foreign keys
Data is controlled & queried via SQL
WARNING: extreme oversimplification
Sunday, August 1, 2010
Trends:Criticisms of RDBMS
Rigid data model
Hard to scale / distribute
Slow (transactions, disk seeks)
SQL not well standardized
Awkward for modern/dynamic languages
Lots of disagreement over this
There are points & counterpoints from both sides
The debate is not over
Not here to deliver a verdict
POINT: This is why we see an explosion of new databases.
Sunday, August 1, 2010
Trends:Fragmentation
Relational with ORM (Hibernate, SQLAlchemy)
ODBMS / ORDBMS (push OO-concepts into database)
Key-Value Stores (MemcacheDB, Redis, Cassandra)
Graph (neo4j)
Document Oriented (Mongo, Couch, etc...)categories are incomplete
some don’t fit neatly into categories
As with so many things in technology, we’re seeing... FRAGMENTATION!
some examples of DB categories
Sunday, August 1, 2010
Where Mongo Fits
“The Best Features ofDocument Databases,
Key-Value Stores,and RDBMSes.”
Mongo’s Tagline (taken from website)
Sunday, August 1, 2010
What is Mongo
Document-Oriented Database
Produced by 10gen / Implemented in C++
Source Code Available
Runs on Linux, Mac, Windows, Solaris
Database: GNU AGPL v3.0 License
Drivers: Apache License v2.0
Sunday, August 1, 2010
MongoAdvantages
json-style documents (dynamic schemas)
flexible indexing (B-Tree)
replication and high-availability (HA)
automatic sharding support (v1.6)*
easy-to-use API
fast queries (auto-tuning planner)
fast insert & deletes (sometimes trade-offs)
sharding support available as of v1.6 (late July 2010)
many of these taken straight from home page
Sunday, August 1, 2010
MongoLanguage Bindings
C, C++, Java
Python, Ruby, Perl
PHP, JavaScript
(many more community supported ones)
Sunday, August 1, 2010
MongoDisadvantages
No Relational Model / SQL
No Explicit Transactions / ACID
Limited Query API You can do a lot more with MapReduce and JavaScript!
Operations can only be atomic within single collection. (Generally)
Can mimic with foreign IDs, but referential integrity not enforced.
Sunday, August 1, 2010
When to use Mongo
Rich semistructured records (Documents)
Transaction isolation not essential
Humongous amounts of data
Need for extreme speed
You hate schema migrations
My personal take on this...
Caveat: I’ve never used Mongo in Production!
Sunday, August 1, 2010
Part II:Mongo Basic Usage
BRIEFLY cover:
- Download, Install, Configure- connection, creating DB, creating Collection- CRUD operations (Insert, Query, Update, Delete)
Sunday, August 1, 2010
Installing Mongo
Use a 64-bit OS (Linux, Mac, Windows)
Get Binaries: www.mongodb.org
Run “mongod” process
32-bit available; not for production
PyMongo uses memory-mapped files.
32-bits limits database to 2 GB!
Sunday, August 1, 2010
Installing PyMongo
Download: http://pypi.python.org/pypi/pymongo/1.7
Build with setuptools
(includes C extension for speed)
# python setup.py install
# python setup.py --no-ext install
(to compile without extension)
Sunday, August 1, 2010
Mongo Anatomy
Database
Collection
Document
Mongo Server
Sunday, August 1, 2010
>>> import pymongo
>>> connection = pymongo.Connection(“localhost”)
Getting a Connection
Connection required for using Mongo
Sunday, August 1, 2010
>>> db = connection.mydatabase
Finding a Database
Databases = logically separate stores
Navigation using properties
Will create DB if not found
Sunday, August 1, 2010
>>> blog = db.blog
Using a Collection
Collection is analogous to Table
Contains documents
Will create collection if not found
Sunday, August 1, 2010
>>> entry1 = {“title”: “Mongo Tutorial”, “body”: “Here’s a document to insert.” }
>>> blog.insert(entry1)
ObjectId('4c3a12eb1d41c82762000001')
Inserting
collection.insert(document) => document_id
document
Sunday, August 1, 2010
>>> entry1
{'_id': ObjectId('4c3a12eb1d41c82762000001'), 'body': "Here's a document to insert.", 'title': 'Mongo Tutorial'}
Inserting (contd.)
Documents must have ‘_id’ field
Automatically generated unless assigned
12-byte unique binary value You can also assign your own ‘_id’, can be any unique value.
Mongo’s IDs are designed to be unique...
...even if hundreds of thousands of documents are generated per second, on numerous clustered machines.
ID generated by driver. No waiting on DB.
Sunday, August 1, 2010
>>> entry2 = {"title": "Another Post", "body": "Mongo is powerful", "author": "David", "tags": ["Mongo", "Power"]}
>>> blog.insert(entry2)ObjectId('4c3a1a501d41c82762000002')
Inserting (contd.)
Documents may have different properties
Properties may be atomic, lists, dictionaries
another documentSunday, August 1, 2010
>>> blog.ensure_index(“author”)
>>> blog.ensure_index(“tags”)
Indexing
May create index on any field
If field is list => index associates all values
index by single value
by multiple values
Sunday, August 1, 2010
bulk_entries = [ ]for i in range(100000): entry = { "title": "Bulk Entry #%i" % (i+1), "body": "What Content!", "author": random.choice(["David", "Robot"]), "tags": ["bulk",
random.choice(["Red", "Blue", "Green"])] } bulk_entries.append(entry)
Bulk Insert
Let’s produce 100,000 fake posts
Sunday, August 1, 2010
>>> blog.insert(bulk_entries)
[ObjectId(...), ObjectId(...), ...]
Bulk Insert (contd.)
collection.insert(list_of_documents)
Inserts 100,000 entries into blog
Returns in 2.11 seconds
Sunday, August 1, 2010
>>> blog.remove() # clear everything
>>> blog.insert(bulk_entries, safe=True)
Bulk Insert (contd.)
returns in 7.90 seconds (vs. 2.11 seconds)
driver returns early; DB is still working
...unless you specify “safe=True”
Sunday, August 1, 2010
>>> blog.find_one({“title”: “Bulk Entry #12253”})
{u'_id': ObjectId('4c3a1e411d41c82762018a89'), u'author': u'Robot', u'body': u'What Content!', u'tags': [u'bulk', u'Green'], u'title': u'Bulk Entry #99999'}
Querying
collection.find_one(spec) => document
spec = document of query parameters
presumably, need more entries to effectively test index performance...
returned in 0.04s - extremely fast
No index created for “title”!
Sunday, August 1, 2010
>>> blog.find_one({“title”: “Bulk Entry #12253”, “tags”: “Green”})
{u'_id': ObjectId('4c3a1e411d41c82762018a89'), u'author': u'Robot', u'body': u'What Content!', u'tags': [u'bulk', u'Green'], u'title': u'Bulk Entry #99999'}
Querying(Specs)
Multiple conditions on document => “AND”
Value for tags is an “ANY” match
presumably, need more entries to effectively test index performance...
Sunday, August 1, 2010
>>> green_items = [ ]>>> for item in blog.find({“tags”: “Green”}): green_items.append(item)
Querying(Multiple)
collection.find(spec) => cursor
new items are fetched in bulk (behind the scenes)
>>> green_items = list(blog.find({“tags”: “Green”}))
- or -
Sunday, August 1, 2010
>>> blog.find({"tags": "Green"}).count()
16646
Querying(Counting)
Use the find() method + count()
Returns number of matches found
presumably, need more entries to effectively test index performance...
Sunday, August 1, 2010
>>> item = blog.find_one({“title”: “Bulk Entry #12253”})>>> item.tags.append(“New”)>>> blog.update({“_id”: item[‘_id’]}, item)
Updating
collection.update(spec, document)
updates single document matching spec
“multi=True” => updates all matching docs
Sunday, August 1, 2010
>>> blog.remove({"author":"Robot"}, safe=True)
Deleting
use remove(...)
it works like find(...)
Example removed approximately 50% of records.
Took 2.48 seconds
Sunday, August 1, 2010
Part III:Advanced Features
Sunday, August 1, 2010
Advanced Querying
Regular Expressions
{“tag” : re.compile(r“^Green|Blue$”)}
Nested Values {“foo.bar.x” : 3}
$where Clause (JavaScript)
Sunday, August 1, 2010
>>> blog.find({“$or”: [{“tags”: “Green”}, {“tags”: “Blue”}]})
Advanced Querying
$lt, $gt, $lte, $gte, $ne
$in, $nin, $mod, $all, $size, $exists, $type
$or, $not
$elemmatch
Sunday, August 1, 2010
>>> blog.find().limit(50) # find 50 articles>>> blog.find().sort(“title”).limit(30) # 30 titles>>> blog.find().distinct(“author”) # unique author names
Advanced Querying
collection.find(...)
sort(“name”) - sorting
limit(...) & skip(...) [like LIMIT & OFFSET]
distinct(...) [like SQL’s DISTINCT]
collection.group(...) - like SQL’s GROUP BYwon’t be showing detailed examples of all these...
there are good tutorials online for all of this
let’s move on to something even more interesting
Sunday, August 1, 2010
Map/Reduce
collection.map_reduce(mapper, reducer)
ultimate in querying power
distribute across multiple nodes
Most powerful querying mechanism
Sunday, August 1, 2010
Map/ReduceVisualized
Java MapReduce
Mappermap()
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapred.MapReduceBase;import org.apache.hadoop.mapred.Mapper;import org.apache.hadoop.mapred.OutputCollector;import org.apache.hadoop.mapred.Reporter;
public class MaxTemperatureMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
private static final int MISSING = 9999; public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); String year = line.substring(15, 19); int airTemperature; if (line.charAt(87) == '+') { // parseInt doesn't like leading plus signs airTemperature = Integer.parseInt(line.substring(88, 92)); } else { airTemperature = Integer.parseInt(line.substring(87, 92)); } String quality = line.substring(92, 93); if (airTemperature != MISSING && quality.matches("[01459]")) { output.collect(new Text(year), new IntWritable(airTemperature)); } }}
20 | Chapter 2: MapReduce
Diagram Credit:Hadoop: The Definitive Guideby Tom White; O’Reilly Books
Chapter 2, page 20
also see: Map/Reduce : A Visual Explanation
1 2 3
Sunday, August 1, 2010
db.runCommand({mapreduce: "DenormAggCollection",query: { filter1: { '$in': [ 'A', 'B' ] }, filter2: 'C', filter3: { '$gt': 123 } },map: function() { emit( { d1: this.Dim1, d2: this.Dim2 }, { msum: this.measure1, recs: 1, mmin: this.measure1, mmax: this.measure2 < 100 ? this.measure2 : 0 } );},reduce: function(key, vals) { var ret = { msum: 0, recs: 0, mmin: 0, mmax: 0 }; for(var i = 0; i < vals.length; i++) { ret.msum += vals[i].msum; ret.recs += vals[i].recs; if(vals[i].mmin < ret.mmin) ret.mmin = vals[i].mmin; if((vals[i].mmax < 100) && (vals[i].mmax > ret.mmax)) ret.mmax = vals[i].mmax; } return ret; },finalize: function(key, val) { val.mavg = val.msum / val.recs; return val; },out: 'result1',verbose: true});db.result1. find({ mmin: { '$gt': 0 } }). sort({ recs: -1 }). skip(4). limit(8);
SELECT Dim1, Dim2, SUM(Measure1) AS MSum, COUNT(*) AS RecordCount, AVG(Measure2) AS MAvg, MIN(Measure1) AS MMin MAX(CASE WHEN Measure2 < 100 THEN Measure2 END) AS MMaxFROM DenormAggTableWHERE (Filter1 IN (’A’,’B’)) AND (Filter2 = ‘C’) AND (Filter3 > 123)GROUP BY Dim1, Dim2HAVING (MMin > 0)ORDER BY RecordCount DESCLIMIT 4, 8
!
"
#
$
%
!
&'
!
"
#
$
%
()*+,-./.01-230*2/4*5+123/6)-/,+55-./*+7/63/8-93/02/7:-/16,/;+2470*2</)-.+402=/7:-/30>-/*;/7:-/?*)802=/3-7@
A-63+)-3/1+37/B-/162+6559/6==)-=67-.@
C==)-=67-3/.-,-2.02=/*2/)-4*)./4*+273/1+37/?607/+2705/;02650>670*2@
A-63+)-3/462/+3-/,)*4-.+)65/5*=04@
D057-)3/:6E-/62/FGAHC470E-G-4*).I5**802=/3795-@
' C==)-=67-/;057-)02=/1+37/B-/6,,50-./7*/7:-/)-3+57/3-7</2*7/02/7:-/16,H)-.+4-@
& C34-2.02=J/!K/L-34-2.02=J/I!
G-E030*2/$</M)-67-./"N!NIN#IN'
G048/F3B*)2-</)048*3B*)2-@*)=
19OPQ A*2=*LR
http://rickosborne.org/download/SQL-to-MongoDB.pdfSunday, August 1, 2010
Map/ReduceExamples
This is me, playing with Map/Reduce
Sunday, August 1, 2010
Health Clinic Example
Person registers with the Clinic
Weighs in on the scale
1 year => comes in 100 times
Sunday, August 1, 2010
Health Clinic Example
person = { “name”: “Bob”,
! “weighings”: [
! ! {“date”: date(2009, 1, 15), “weight”: 165.0},
! ! {“date”: date(2009, 2, 12), “weight”: 163.2},
! ! ... ]
}
Sunday, August 1, 2010
for i in range(N): person = { 'name': 'person%04i' % i } weighings = person['weighings'] = [ ] std_weight = random.uniform(100, 200) for w in range(100): date = (datetime.datetime(2009, 1, 1) + datetime.timedelta( days=random.randint(0, 365)) weight = random.normalvariate(std_weight, 5.0) weighings.append({ 'date': date, 'weight': weight }) weighings.sort(key=lambda x: x['date']) all_people.append(person)
Map/ReduceInsert Script
Sunday, August 1, 2010
Insert DataPerformance
1
10
100
1000
1k 10k 100k
3.14s
29.5s
292s
Insert
LOG-LOG scale
Linear scaling
Sunday, August 1, 2010
map_fn = Code("""function () { this.weighings.forEach(function(z) { emit(z.date, z.weight); });}""")
reduce_fn = Code("""function (key, values) { var total = 0; for (var i = 0; i < values.length; i++) { total += values[i]; } return total;}""")
result = people.map_reduce(map_fn, reduce_fn)
Map/ReduceTotal Weight by Day
Sunday, August 1, 2010
>>> for doc in result.find(): print doc
{u'_id': datetime.datetime(2009, 1, 1, 0, 0), u'value': 39136.600753163315}{u'_id': datetime.datetime(2009, 1, 2, 0, 0), u'value': 41685.341024046182}{u'_id': datetime.datetime(2009, 1, 3, 0, 0), u'value': 38232.326554504165}
... lots more ...
Map/ReduceTotal Weight by Day
Sunday, August 1, 2010
Total Weight by Day Performance
1
10
100
1000
1k 10k 100k
4.29s
38.8s
384s
MapReduce
Sunday, August 1, 2010
map_fn = Code("""function () { var target_date = new Date(2009, 9, 5); var pos = bsearch(this.weighings, "date", target_date); var recent = this.weighings[pos]; emit(this._id, { name: this.name, date: recent.date, weight: recent.weight });};""")
reduce_fn = Code("""function (key, values) { return values[0];};""")
result = people.map_reduce(map_fn, reduce_fn, scope={"bsearch": bsearch})
Map/ReduceWeight on Day
Sunday, August 1, 2010
bsearch = Code("""function(array, prop, value) { var min, max, mid, midval; for(min = 0, max = array.length - 1; min <= max; ) { mid = min + Math.floor((max - min) / 2); midval = array[mid][prop]; if(value === midval) { break; } else if(value > midval) { min = mid + 1; } else { max = mid - 1; } } return (midval > value) ? mid - 1 : mid;};""")
Map/Reducebsearch() function
Sunday, August 1, 2010
Weight on DayPerformance
1
10
100
1000
1k 10k 100k1.23s
10s
108s
MapReduce
Sunday, August 1, 2010
target_date = datetime.datetime(2009, 10, 5)
for person in people.find(): dates = [ w['date'] for w in person['weighings'] ] pos = bisect.bisect_right(dates, target_date) val = person['weighings'][pos]
Weight on Day(Python Version)
Sunday, August 1, 2010
Map/ReducePerformance
0.1
1
10
100
1000
1k 10k 100k
0.37s
2.2s
26s
1.23s
10s
108s
MapReduce Python
Sunday, August 1, 2010
Summary
Sunday, August 1, 2010
Resources
www.10gen.com
www.mongodb.org
MongoDBThe Definitive Guide
O’Reilly
api.mongodb.org/pythonPyMongo
Sunday, August 1, 2010
END OF SLIDES
Sunday, August 1, 2010
Chalkboardis not Comic Sans
This is Chalkboard, not Comic Sans.
This isn’t Chalkboard, it’s Comic Sans.
does it matter, anyway?
Sunday, August 1, 2010