introducing_in_mongodb

transcript

Introducing: MongoDB

David J. C. Beach

Sunday, August 1, 2010

David Beach

Software Consultant (past 6 years)

Python since v1.4 (late 90’s)

Design, Algorithms, Data Structures

Sometimes Database stuff

not a “frameworks” guy

Organizer: Front Range Pythoneers

Outline

Part I: Trends in Databases

Part II: Mongo Basic Usage

Part III: Advanced Features

Part I:Trends in Databases

Database Trends

Past: “Relational” (RDBMS)

Data stored in Tables, Rows, Columns

Relationships designated by Primary, Foreign keys

Data is controlled & queried via SQL

WARNING: extreme oversimplification

Trends:Criticisms of RDBMS

Rigid data model

Hard to scale / distribute

Slow (transactions, disk seeks)

SQL not well standardized

Awkward for modern/dynamic languages

Lots of disagreement over this

There are points & counterpoints from both sides

The debate is not over

Not here to deliver a verdict

POINT: This is why we see an explosion of new databases.

Trends:Fragmentation

Relational with ORM (Hibernate, SQLAlchemy)

ODBMS / ORDBMS (push OO-concepts into database)

Key-Value Stores (MemcacheDB, Redis, Cassandra)

Graph (neo4j)

Document Oriented (Mongo, Couch, etc...)categories are incomplete

some don’t fit neatly into categories

As with so many things in technology, we’re seeing... FRAGMENTATION!

some examples of DB categories

Where Mongo Fits

“The Best Features ofDocument Databases,

Key-Value Stores,and RDBMSes.”

Mongo’s Tagline (taken from website)

What is Mongo

Document-Oriented Database

Produced by 10gen / Implemented in C++

Source Code Available

Runs on Linux, Mac, Windows, Solaris

Database: GNU AGPL v3.0 License

Drivers: Apache License v2.0

MongoAdvantages

json-style documents (dynamic schemas)

flexible indexing (B-Tree)

replication and high-availability (HA)

automatic sharding support (v1.6)*

easy-to-use API

fast queries (auto-tuning planner)

fast insert & deletes (sometimes trade-offs)

sharding support available as of v1.6 (late July 2010)

many of these taken straight from home page

MongoLanguage Bindings

C, C++, Java

Python, Ruby, Perl

PHP, JavaScript

(many more community supported ones)

MongoDisadvantages

No Relational Model / SQL

No Explicit Transactions / ACID

Limited Query API You can do a lot more with MapReduce and JavaScript!

Operations can only be atomic within single collection. (Generally)

Can mimic with foreign IDs, but referential integrity not enforced.

When to use Mongo

Rich semistructured records (Documents)

Transaction isolation not essential

Humongous amounts of data

Need for extreme speed

You hate schema migrations

My personal take on this...

Caveat: I’ve never used Mongo in Production!

Part II:Mongo Basic Usage

BRIEFLY cover:

- Download, Install, Configure- connection, creating DB, creating Collection- CRUD operations (Insert, Query, Update, Delete)

Installing Mongo

Use a 64-bit OS (Linux, Mac, Windows)

Get Binaries: www.mongodb.org

Run “mongod” process

32-bit available; not for production

PyMongo uses memory-mapped files.

32-bits limits database to 2 GB!

Installing PyMongo

Download: http://pypi.python.org/pypi/pymongo/1.7

Build with setuptools

(includes C extension for speed)

# python setup.py install

# python setup.py --no-ext install

(to compile without extension)

Mongo Anatomy

Database

Collection

Document

Mongo Server

>>> import pymongo

>>> connection = pymongo.Connection(“localhost”)

Getting a Connection

Connection required for using Mongo

>>> db = connection.mydatabase

Finding a Database

Databases = logically separate stores

Navigation using properties

Will create DB if not found

>>> blog = db.blog

Using a Collection

Collection is analogous to Table

Contains documents

Will create collection if not found

>>> entry1 = {“title”: “Mongo Tutorial”, “body”: “Here’s a document to insert.” }

>>> blog.insert(entry1)

ObjectId('4c3a12eb1d41c82762000001')

Inserting

collection.insert(document) => document_id

document

>>> entry1

{'_id': ObjectId('4c3a12eb1d41c82762000001'), 'body': "Here's a document to insert.", 'title': 'Mongo Tutorial'}

Inserting (contd.)

Documents must have ‘_id’ field

Automatically generated unless assigned

12-byte unique binary value You can also assign your own ‘_id’, can be any unique value.

Mongo’s IDs are designed to be unique...

...even if hundreds of thousands of documents are generated per second, on numerous clustered machines.

ID generated by driver. No waiting on DB.

>>> entry2 = {"title": "Another Post", "body": "Mongo is powerful", "author": "David", "tags": ["Mongo", "Power"]}

>>> blog.insert(entry2)ObjectId('4c3a1a501d41c82762000002')

Inserting (contd.)

Documents may have different properties

Properties may be atomic, lists, dictionaries

another documentSunday, August 1, 2010

>>> blog.ensure_index(“author”)

>>> blog.ensure_index(“tags”)

Indexing

May create index on any field

If field is list => index associates all values

index by single value

by multiple values

bulk_entries = [ ]for i in range(100000): entry = { "title": "Bulk Entry #%i" % (i+1), "body": "What Content!", "author": random.choice(["David", "Robot"]), "tags": ["bulk",

random.choice(["Red", "Blue", "Green"])] } bulk_entries.append(entry)

Bulk Insert

Let’s produce 100,000 fake posts

>>> blog.insert(bulk_entries)

[ObjectId(...), ObjectId(...), ...]

Bulk Insert (contd.)

collection.insert(list_of_documents)

Inserts 100,000 entries into blog

Returns in 2.11 seconds

>>> blog.remove() # clear everything

>>> blog.insert(bulk_entries, safe=True)

Bulk Insert (contd.)

returns in 7.90 seconds (vs. 2.11 seconds)

driver returns early; DB is still working

...unless you specify “safe=True”

>>> blog.find_one({“title”: “Bulk Entry #12253”})

{u'_id': ObjectId('4c3a1e411d41c82762018a89'), u'author': u'Robot', u'body': u'What Content!', u'tags': [u'bulk', u'Green'], u'title': u'Bulk Entry #99999'}

Querying

collection.find_one(spec) => document

spec = document of query parameters

presumably, need more entries to effectively test index performance...

returned in 0.04s - extremely fast

No index created for “title”!

>>> blog.find_one({“title”: “Bulk Entry #12253”, “tags”: “Green”})

{u'_id': ObjectId('4c3a1e411d41c82762018a89'), u'author': u'Robot', u'body': u'What Content!', u'tags': [u'bulk', u'Green'], u'title': u'Bulk Entry #99999'}

Querying(Specs)

Multiple conditions on document => “AND”

Value for tags is an “ANY” match

>>> green_items = [ ]>>> for item in blog.find({“tags”: “Green”}): green_items.append(item)

Querying(Multiple)

collection.find(spec) => cursor

new items are fetched in bulk (behind the scenes)

>>> green_items = list(blog.find({“tags”: “Green”}))

- or -

>>> blog.find({"tags": "Green"}).count()

Querying(Counting)

Use the find() method + count()

Returns number of matches found

>>> item = blog.find_one({“title”: “Bulk Entry #12253”})>>> item.tags.append(“New”)>>> blog.update({“_id”: item[‘_id’]}, item)

Updating

collection.update(spec, document)

updates single document matching spec

“multi=True” => updates all matching docs

>>> blog.remove({"author":"Robot"}, safe=True)

Deleting

use remove(...)

it works like find(...)

Example removed approximately 50% of records.

Took 2.48 seconds

Part III:Advanced Features

Advanced Querying

Regular Expressions

{“tag” : re.compile(r“^Green|Blue$”)}

Nested Values {“foo.bar.x” : 3}

$where Clause (JavaScript)

>>> blog.find({“$or”: [{“tags”: “Green”}, {“tags”: “Blue”}]})

Advanced Querying

$lt, $gt, $lte, $gte, $ne

$in, $nin, $mod, $all, $size, $exists, $type

$or, $not

$elemmatch

>>> blog.find().limit(50) # find 50 articles>>> blog.find().sort(“title”).limit(30) # 30 titles>>> blog.find().distinct(“author”) # unique author names

Advanced Querying

collection.find(...)

sort(“name”) - sorting

limit(...) & skip(...) [like LIMIT & OFFSET]

distinct(...) [like SQL’s DISTINCT]

collection.group(...) - like SQL’s GROUP BYwon’t be showing detailed examples of all these...

there are good tutorials online for all of this

let’s move on to something even more interesting

Map/Reduce

collection.map_reduce(mapper, reducer)

ultimate in querying power

distribute across multiple nodes

Most powerful querying mechanism

Map/ReduceVisualized

Java MapReduce

Mappermap()

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapred.MapReduceBase;import org.apache.hadoop.mapred.Mapper;import org.apache.hadoop.mapred.OutputCollector;import org.apache.hadoop.mapred.Reporter;

public class MaxTemperatureMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {

private static final int MISSING = 9999; public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); String year = line.substring(15, 19); int airTemperature; if (line.charAt(87) == '+') { // parseInt doesn't like leading plus signs airTemperature = Integer.parseInt(line.substring(88, 92)); } else { airTemperature = Integer.parseInt(line.substring(87, 92)); } String quality = line.substring(92, 93); if (airTemperature != MISSING && quality.matches("[01459]")) { output.collect(new Text(year), new IntWritable(airTemperature)); } }}

20 | Chapter 2: MapReduce

Diagram Credit:Hadoop: The Definitive Guideby Tom White; O’Reilly Books

Chapter 2, page 20

also see: Map/Reduce : A Visual Explanation

db.runCommand({mapreduce: "DenormAggCollection",query: { filter1: { '$in': [ 'A', 'B' ] }, filter2: 'C', filter3: { '$gt': 123 } },map: function() { emit( { d1: this.Dim1, d2: this.Dim2 }, { msum: this.measure1, recs: 1, mmin: this.measure1, mmax: this.measure2 < 100 ? this.measure2 : 0 } );},reduce: function(key, vals) { var ret = { msum: 0, recs: 0, mmin: 0, mmax: 0 }; for(var i = 0; i < vals.length; i++) { ret.msum += vals[i].msum; ret.recs += vals[i].recs; if(vals[i].mmin < ret.mmin) ret.mmin = vals[i].mmin; if((vals[i].mmax < 100) && (vals[i].mmax > ret.mmax)) ret.mmax = vals[i].mmax; } return ret; },finalize: function(key, val) { val.mavg = val.msum / val.recs; return val; },out: 'result1',verbose: true});db.result1. find({ mmin: { '$gt': 0 } }). sort({ recs: -1 }). skip(4). limit(8);

SELECT Dim1, Dim2, SUM(Measure1) AS MSum, COUNT(*) AS RecordCount, AVG(Measure2) AS MAvg, MIN(Measure1) AS MMin MAX(CASE WHEN Measure2 < 100 THEN Measure2 END) AS MMaxFROM DenormAggTableWHERE (Filter1 IN (’A’,’B’)) AND (Filter2 = ‘C’) AND (Filter3 > 123)GROUP BY Dim1, Dim2HAVING (MMin > 0)ORDER BY RecordCount DESCLIMIT 4, 8

()*+,-./.01-230*2/4*5+123/6)-/,+55-./*+7/63/8-93/02/7:-/16,/;+2470*2</)-.+402=/7:-/30>-/*;/7:-/?*)802=/3-7@

A-63+)-3/1+37/B-/162+6559/6==)-=67-.@

C==)-=67-3/.-,-2.02=/*2/)-4*)./4*+273/1+37/?607/+2705/;02650>670*2@

A-63+)-3/462/+3-/,)*4-.+)65/5*=04@

D057-)3/:6E-/62/FGAHC470E-G-4*).I5**802=/3795-@

' C==)-=67-/;057-)02=/1+37/B-/6,,50-./7*/7:-/)-3+57/3-7</2*7/02/7:-/16,H)-.+4-@

& C34-2.02=J/!K/L-34-2.02=J/I!

G-E030*2/$</M)-67-./"N!NIN#IN'

G048/F3B*)2-</)048*3B*)2-@*)=

19OPQ A*2=*LR

http://rickosborne.org/download/SQL-to-MongoDB.pdfSunday, August 1, 2010

Map/ReduceExamples

This is me, playing with Map/Reduce

Health Clinic Example

Person registers with the Clinic

Weighs in on the scale

1 year => comes in 100 times

Health Clinic Example

person = { “name”: “Bob”,

! “weighings”: [

! ! {“date”: date(2009, 1, 15), “weight”: 165.0},

! ! {“date”: date(2009, 2, 12), “weight”: 163.2},

! ! ... ]

for i in range(N): person = { 'name': 'person%04i' % i } weighings = person['weighings'] = [ ] std_weight = random.uniform(100, 200) for w in range(100): date = (datetime.datetime(2009, 1, 1) + datetime.timedelta( days=random.randint(0, 365)) weight = random.normalvariate(std_weight, 5.0) weighings.append({ 'date': date, 'weight': weight }) weighings.sort(key=lambda x: x['date']) all_people.append(person)

Map/ReduceInsert Script

Insert DataPerformance

1k 10k 100k

Insert

LOG-LOG scale

Linear scaling

map_fn = Code("""function () { this.weighings.forEach(function(z) { emit(z.date, z.weight); });}""")

reduce_fn = Code("""function (key, values) { var total = 0; for (var i = 0; i < values.length; i++) { total += values[i]; } return total;}""")

result = people.map_reduce(map_fn, reduce_fn)

Map/ReduceTotal Weight by Day

>>> for doc in result.find(): print doc

{u'_id': datetime.datetime(2009, 1, 1, 0, 0), u'value': 39136.600753163315}{u'_id': datetime.datetime(2009, 1, 2, 0, 0), u'value': 41685.341024046182}{u'_id': datetime.datetime(2009, 1, 3, 0, 0), u'value': 38232.326554504165}

... lots more ...

Map/ReduceTotal Weight by Day

Total Weight by Day Performance

1k 10k 100k

MapReduce

map_fn = Code("""function () { var target_date = new Date(2009, 9, 5); var pos = bsearch(this.weighings, "date", target_date); var recent = this.weighings[pos]; emit(this._id, { name: this.name, date: recent.date, weight: recent.weight });};""")

reduce_fn = Code("""function (key, values) { return values[0];};""")

result = people.map_reduce(map_fn, reduce_fn, scope={"bsearch": bsearch})

Map/ReduceWeight on Day

bsearch = Code("""function(array, prop, value) { var min, max, mid, midval; for(min = 0, max = array.length - 1; min <= max; ) { mid = min + Math.floor((max - min) / 2); midval = array[mid][prop]; if(value === midval) { break; } else if(value > midval) { min = mid + 1; } else { max = mid - 1; } } return (midval > value) ? mid - 1 : mid;};""")

Map/Reducebsearch() function

Weight on DayPerformance

1k 10k 100k1.23s

MapReduce

target_date = datetime.datetime(2009, 10, 5)

for person in people.find(): dates = [ w['date'] for w in person['weighings'] ] pos = bisect.bisect_right(dates, target_date) val = person['weighings'][pos]

Weight on Day(Python Version)

Map/ReducePerformance

1k 10k 100k

MapReduce Python

Summary

Resources

www.10gen.com

www.mongodb.org

MongoDBThe Definitive Guide

O’Reilly

api.mongodb.org/pythonPyMongo

END OF SLIDES

Chalkboardis not Comic Sans

This is Chalkboard, not Comic Sans.

This isn’t Chalkboard, it’s Comic Sans.

does it matter, anyway?