introducing_in_mongodb

Post on 01-Feb-2016

218 views 0 download

Tags:

description

Начальные знания о mongodb

transcript

Introducing: MongoDB

David J. C. Beach

Sunday, August 1, 2010

David Beach

Software Consultant (past 6 years)

Python since v1.4 (late 90’s)

Design, Algorithms, Data Structures

Sometimes Database stuff

not a “frameworks” guy

Organizer: Front Range Pythoneers

Sunday, August 1, 2010

Outline

Part I: Trends in Databases

Part II: Mongo Basic Usage

Part III: Advanced Features

Sunday, August 1, 2010

Part I:Trends in Databases

Sunday, August 1, 2010

Database Trends

Past: “Relational” (RDBMS)

Data stored in Tables, Rows, Columns

Relationships designated by Primary, Foreign keys

Data is controlled & queried via SQL

WARNING: extreme oversimplification

Sunday, August 1, 2010

Trends:Criticisms of RDBMS

Rigid data model

Hard to scale / distribute

Slow (transactions, disk seeks)

SQL not well standardized

Awkward for modern/dynamic languages

Lots of disagreement over this

There are points & counterpoints from both sides

The debate is not over

Not here to deliver a verdict

POINT: This is why we see an explosion of new databases.

Sunday, August 1, 2010

Trends:Fragmentation

Relational with ORM (Hibernate, SQLAlchemy)

ODBMS / ORDBMS (push OO-concepts into database)

Key-Value Stores (MemcacheDB, Redis, Cassandra)

Graph (neo4j)

Document Oriented (Mongo, Couch, etc...)categories are incomplete

some don’t fit neatly into categories

As with so many things in technology, we’re seeing... FRAGMENTATION!

some examples of DB categories

Sunday, August 1, 2010

Where Mongo Fits

“The Best Features ofDocument Databases,

Key-Value Stores,and RDBMSes.”

Mongo’s Tagline (taken from website)

Sunday, August 1, 2010

What is Mongo

Document-Oriented Database

Produced by 10gen / Implemented in C++

Source Code Available

Runs on Linux, Mac, Windows, Solaris

Database: GNU AGPL v3.0 License

Drivers: Apache License v2.0

Sunday, August 1, 2010

MongoAdvantages

json-style documents (dynamic schemas)

flexible indexing (B-Tree)

replication and high-availability (HA)

automatic sharding support (v1.6)*

easy-to-use API

fast queries (auto-tuning planner)

fast insert & deletes (sometimes trade-offs)

sharding support available as of v1.6 (late July 2010)

many of these taken straight from home page

Sunday, August 1, 2010

MongoLanguage Bindings

C, C++, Java

Python, Ruby, Perl

PHP, JavaScript

(many more community supported ones)

Sunday, August 1, 2010

MongoDisadvantages

No Relational Model / SQL

No Explicit Transactions / ACID

Limited Query API You can do a lot more with MapReduce and JavaScript!

Operations can only be atomic within single collection. (Generally)

Can mimic with foreign IDs, but referential integrity not enforced.

Sunday, August 1, 2010

When to use Mongo

Rich semistructured records (Documents)

Transaction isolation not essential

Humongous amounts of data

Need for extreme speed

You hate schema migrations

My personal take on this...

Caveat: I’ve never used Mongo in Production!

Sunday, August 1, 2010

Part II:Mongo Basic Usage

BRIEFLY cover:

- Download, Install, Configure- connection, creating DB, creating Collection- CRUD operations (Insert, Query, Update, Delete)

Sunday, August 1, 2010

Installing Mongo

Use a 64-bit OS (Linux, Mac, Windows)

Get Binaries: www.mongodb.org

Run “mongod” process

32-bit available; not for production

PyMongo uses memory-mapped files.

32-bits limits database to 2 GB!

Sunday, August 1, 2010

Installing PyMongo

Download: http://pypi.python.org/pypi/pymongo/1.7

Build with setuptools

(includes C extension for speed)

# python setup.py install

# python setup.py --no-ext install

(to compile without extension)

Sunday, August 1, 2010

Mongo Anatomy

Database

Collection

Document

Mongo Server

Sunday, August 1, 2010

>>> import pymongo

>>> connection = pymongo.Connection(“localhost”)

Getting a Connection

Connection required for using Mongo

Sunday, August 1, 2010

>>> db = connection.mydatabase

Finding a Database

Databases = logically separate stores

Navigation using properties

Will create DB if not found

Sunday, August 1, 2010

>>> blog = db.blog

Using a Collection

Collection is analogous to Table

Contains documents

Will create collection if not found

Sunday, August 1, 2010

>>> entry1 = {“title”: “Mongo Tutorial”, “body”: “Here’s a document to insert.” }

>>> blog.insert(entry1)

ObjectId('4c3a12eb1d41c82762000001')

Inserting

collection.insert(document) => document_id

document

Sunday, August 1, 2010

>>> entry1

{'_id': ObjectId('4c3a12eb1d41c82762000001'), 'body': "Here's a document to insert.", 'title': 'Mongo Tutorial'}

Inserting (contd.)

Documents must have ‘_id’ field

Automatically generated unless assigned

12-byte unique binary value You can also assign your own ‘_id’, can be any unique value.

Mongo’s IDs are designed to be unique...

...even if hundreds of thousands of documents are generated per second, on numerous clustered machines.

ID generated by driver. No waiting on DB.

Sunday, August 1, 2010

>>> entry2 = {"title": "Another Post", "body": "Mongo is powerful", "author": "David", "tags": ["Mongo", "Power"]}

>>> blog.insert(entry2)ObjectId('4c3a1a501d41c82762000002')

Inserting (contd.)

Documents may have different properties

Properties may be atomic, lists, dictionaries

another documentSunday, August 1, 2010

>>> blog.ensure_index(“author”)

>>> blog.ensure_index(“tags”)

Indexing

May create index on any field

If field is list => index associates all values

index by single value

by multiple values

Sunday, August 1, 2010

bulk_entries = [ ]for i in range(100000): entry = { "title": "Bulk Entry #%i" % (i+1), "body": "What Content!", "author": random.choice(["David", "Robot"]), "tags": ["bulk",

random.choice(["Red", "Blue", "Green"])] } bulk_entries.append(entry)

Bulk Insert

Let’s produce 100,000 fake posts

Sunday, August 1, 2010

>>> blog.insert(bulk_entries)

[ObjectId(...), ObjectId(...), ...]

Bulk Insert (contd.)

collection.insert(list_of_documents)

Inserts 100,000 entries into blog

Returns in 2.11 seconds

Sunday, August 1, 2010

>>> blog.remove() # clear everything

>>> blog.insert(bulk_entries, safe=True)

Bulk Insert (contd.)

returns in 7.90 seconds (vs. 2.11 seconds)

driver returns early; DB is still working

...unless you specify “safe=True”

Sunday, August 1, 2010

>>> blog.find_one({“title”: “Bulk Entry #12253”})

{u'_id': ObjectId('4c3a1e411d41c82762018a89'), u'author': u'Robot', u'body': u'What Content!', u'tags': [u'bulk', u'Green'], u'title': u'Bulk Entry #99999'}

Querying

collection.find_one(spec) => document

spec = document of query parameters

presumably, need more entries to effectively test index performance...

returned in 0.04s - extremely fast

No index created for “title”!

Sunday, August 1, 2010

>>> blog.find_one({“title”: “Bulk Entry #12253”, “tags”: “Green”})

{u'_id': ObjectId('4c3a1e411d41c82762018a89'), u'author': u'Robot', u'body': u'What Content!', u'tags': [u'bulk', u'Green'], u'title': u'Bulk Entry #99999'}

Querying(Specs)

Multiple conditions on document => “AND”

Value for tags is an “ANY” match

presumably, need more entries to effectively test index performance...

Sunday, August 1, 2010

>>> green_items = [ ]>>> for item in blog.find({“tags”: “Green”}): green_items.append(item)

Querying(Multiple)

collection.find(spec) => cursor

new items are fetched in bulk (behind the scenes)

>>> green_items = list(blog.find({“tags”: “Green”}))

- or -

Sunday, August 1, 2010

>>> blog.find({"tags": "Green"}).count()

16646

Querying(Counting)

Use the find() method + count()

Returns number of matches found

presumably, need more entries to effectively test index performance...

Sunday, August 1, 2010

>>> item = blog.find_one({“title”: “Bulk Entry #12253”})>>> item.tags.append(“New”)>>> blog.update({“_id”: item[‘_id’]}, item)

Updating

collection.update(spec, document)

updates single document matching spec

“multi=True” => updates all matching docs

Sunday, August 1, 2010

>>> blog.remove({"author":"Robot"}, safe=True)

Deleting

use remove(...)

it works like find(...)

Example removed approximately 50% of records.

Took 2.48 seconds

Sunday, August 1, 2010

Part III:Advanced Features

Sunday, August 1, 2010

Advanced Querying

Regular Expressions

{“tag” : re.compile(r“^Green|Blue$”)}

Nested Values {“foo.bar.x” : 3}

$where Clause (JavaScript)

Sunday, August 1, 2010

>>> blog.find({“$or”: [{“tags”: “Green”}, {“tags”: “Blue”}]})

Advanced Querying

$lt, $gt, $lte, $gte, $ne

$in, $nin, $mod, $all, $size, $exists, $type

$or, $not

$elemmatch

Sunday, August 1, 2010

>>> blog.find().limit(50) # find 50 articles>>> blog.find().sort(“title”).limit(30) # 30 titles>>> blog.find().distinct(“author”) # unique author names

Advanced Querying

collection.find(...)

sort(“name”) - sorting

limit(...) & skip(...) [like LIMIT & OFFSET]

distinct(...) [like SQL’s DISTINCT]

collection.group(...) - like SQL’s GROUP BYwon’t be showing detailed examples of all these...

there are good tutorials online for all of this

let’s move on to something even more interesting

Sunday, August 1, 2010

Map/Reduce

collection.map_reduce(mapper, reducer)

ultimate in querying power

distribute across multiple nodes

Most powerful querying mechanism

Sunday, August 1, 2010

Map/ReduceVisualized

Java MapReduce

Mappermap()

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapred.MapReduceBase;import org.apache.hadoop.mapred.Mapper;import org.apache.hadoop.mapred.OutputCollector;import org.apache.hadoop.mapred.Reporter;

public class MaxTemperatureMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {

private static final int MISSING = 9999; public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); String year = line.substring(15, 19); int airTemperature; if (line.charAt(87) == '+') { // parseInt doesn't like leading plus signs airTemperature = Integer.parseInt(line.substring(88, 92)); } else { airTemperature = Integer.parseInt(line.substring(87, 92)); } String quality = line.substring(92, 93); if (airTemperature != MISSING && quality.matches("[01459]")) { output.collect(new Text(year), new IntWritable(airTemperature)); } }}

20 | Chapter 2: MapReduce

Diagram Credit:Hadoop: The Definitive Guideby Tom White; O’Reilly Books

Chapter 2, page 20

also see: Map/Reduce : A Visual Explanation

1 2 3

Sunday, August 1, 2010

db.runCommand({mapreduce: "DenormAggCollection",query: { filter1: { '$in': [ 'A', 'B' ] }, filter2: 'C', filter3: { '$gt': 123 } },map: function() { emit( { d1: this.Dim1, d2: this.Dim2 }, { msum: this.measure1, recs: 1, mmin: this.measure1, mmax: this.measure2 < 100 ? this.measure2 : 0 } );},reduce: function(key, vals) { var ret = { msum: 0, recs: 0, mmin: 0, mmax: 0 }; for(var i = 0; i < vals.length; i++) { ret.msum += vals[i].msum; ret.recs += vals[i].recs; if(vals[i].mmin < ret.mmin) ret.mmin = vals[i].mmin; if((vals[i].mmax < 100) && (vals[i].mmax > ret.mmax)) ret.mmax = vals[i].mmax; } return ret; },finalize: function(key, val) { val.mavg = val.msum / val.recs; return val; },out: 'result1',verbose: true});db.result1. find({ mmin: { '$gt': 0 } }). sort({ recs: -1 }). skip(4). limit(8);

SELECT Dim1, Dim2, SUM(Measure1) AS MSum, COUNT(*) AS RecordCount, AVG(Measure2) AS MAvg, MIN(Measure1) AS MMin MAX(CASE WHEN Measure2 < 100 THEN Measure2 END) AS MMaxFROM DenormAggTableWHERE (Filter1 IN (’A’,’B’)) AND (Filter2 = ‘C’) AND (Filter3 > 123)GROUP BY Dim1, Dim2HAVING (MMin > 0)ORDER BY RecordCount DESCLIMIT 4, 8

!

"

#

$

%

!

&'

!

"

#

$

%

()*+,-./.01-230*2/4*5+123/6)-/,+55-./*+7/63/8-93/02/7:-/16,/;+2470*2</)-.+402=/7:-/30>-/*;/7:-/?*)802=/3-7@

A-63+)-3/1+37/B-/162+6559/6==)-=67-.@

C==)-=67-3/.-,-2.02=/*2/)-4*)./4*+273/1+37/?607/+2705/;02650>670*2@

A-63+)-3/462/+3-/,)*4-.+)65/5*=04@

D057-)3/:6E-/62/FGAHC470E-G-4*).I5**802=/3795-@

' C==)-=67-/;057-)02=/1+37/B-/6,,50-./7*/7:-/)-3+57/3-7</2*7/02/7:-/16,H)-.+4-@

& C34-2.02=J/!K/L-34-2.02=J/I!

G-E030*2/$</M)-67-./"N!NIN#IN'

G048/F3B*)2-</)048*3B*)2-@*)=

19OPQ A*2=*LR

http://rickosborne.org/download/SQL-to-MongoDB.pdfSunday, August 1, 2010

Map/ReduceExamples

This is me, playing with Map/Reduce

Sunday, August 1, 2010

Health Clinic Example

Person registers with the Clinic

Weighs in on the scale

1 year => comes in 100 times

Sunday, August 1, 2010

Health Clinic Example

person = { “name”: “Bob”,

! “weighings”: [

! ! {“date”: date(2009, 1, 15), “weight”: 165.0},

! ! {“date”: date(2009, 2, 12), “weight”: 163.2},

! ! ... ]

}

Sunday, August 1, 2010

for i in range(N): person = { 'name': 'person%04i' % i } weighings = person['weighings'] = [ ] std_weight = random.uniform(100, 200) for w in range(100): date = (datetime.datetime(2009, 1, 1) + datetime.timedelta( days=random.randint(0, 365)) weight = random.normalvariate(std_weight, 5.0) weighings.append({ 'date': date, 'weight': weight }) weighings.sort(key=lambda x: x['date']) all_people.append(person)

Map/ReduceInsert Script

Sunday, August 1, 2010

Insert DataPerformance

1

10

100

1000

1k 10k 100k

3.14s

29.5s

292s

Insert

LOG-LOG scale

Linear scaling

Sunday, August 1, 2010

map_fn = Code("""function () { this.weighings.forEach(function(z) { emit(z.date, z.weight); });}""")

reduce_fn = Code("""function (key, values) { var total = 0; for (var i = 0; i < values.length; i++) { total += values[i]; } return total;}""")

result = people.map_reduce(map_fn, reduce_fn)

Map/ReduceTotal Weight by Day

Sunday, August 1, 2010

>>> for doc in result.find(): print doc

{u'_id': datetime.datetime(2009, 1, 1, 0, 0), u'value': 39136.600753163315}{u'_id': datetime.datetime(2009, 1, 2, 0, 0), u'value': 41685.341024046182}{u'_id': datetime.datetime(2009, 1, 3, 0, 0), u'value': 38232.326554504165}

... lots more ...

Map/ReduceTotal Weight by Day

Sunday, August 1, 2010

Total Weight by Day Performance

1

10

100

1000

1k 10k 100k

4.29s

38.8s

384s

MapReduce

Sunday, August 1, 2010

map_fn = Code("""function () { var target_date = new Date(2009, 9, 5); var pos = bsearch(this.weighings, "date", target_date); var recent = this.weighings[pos]; emit(this._id, { name: this.name, date: recent.date, weight: recent.weight });};""")

reduce_fn = Code("""function (key, values) { return values[0];};""")

result = people.map_reduce(map_fn, reduce_fn, scope={"bsearch": bsearch})

Map/ReduceWeight on Day

Sunday, August 1, 2010

bsearch = Code("""function(array, prop, value) { var min, max, mid, midval; for(min = 0, max = array.length - 1; min <= max; ) { mid = min + Math.floor((max - min) / 2); midval = array[mid][prop]; if(value === midval) { break; } else if(value > midval) { min = mid + 1; } else { max = mid - 1; } } return (midval > value) ? mid - 1 : mid;};""")

Map/Reducebsearch() function

Sunday, August 1, 2010

Weight on DayPerformance

1

10

100

1000

1k 10k 100k1.23s

10s

108s

MapReduce

Sunday, August 1, 2010

target_date = datetime.datetime(2009, 10, 5)

for person in people.find(): dates = [ w['date'] for w in person['weighings'] ] pos = bisect.bisect_right(dates, target_date) val = person['weighings'][pos]

Weight on Day(Python Version)

Sunday, August 1, 2010

Map/ReducePerformance

0.1

1

10

100

1000

1k 10k 100k

0.37s

2.2s

26s

1.23s

10s

108s

MapReduce Python

Sunday, August 1, 2010

Summary

Sunday, August 1, 2010

Resources

www.10gen.com

www.mongodb.org

MongoDBThe Definitive Guide

O’Reilly

api.mongodb.org/pythonPyMongo

Sunday, August 1, 2010

END OF SLIDES

Sunday, August 1, 2010

Chalkboardis not Comic Sans

This is Chalkboard, not Comic Sans.

This isn’t Chalkboard, it’s Comic Sans.

does it matter, anyway?

Sunday, August 1, 2010