+ All Categories
Home > Documents > Mongouk talk june_18

Mongouk talk june_18

Date post: 15-Jan-2015
Category:
Upload: skills-matter
View: 2,161 times
Download: 2 times
Share this document with a friend
Description:
 
Popular Tags:
84
Transcript
Page 1: Mongouk talk june_18
Page 2: Mongouk talk june_18

Table of Contents

1. Structure:.............................................................................................................................................................41. Markus...............................................................................................................................................................42. Flavio.................................................................................................................................................................5

2. Who are we? ........................................................................................................................................................61. Markus Gattol .....................................................................................................................................................72. Flavio Percoco Premoli ..........................................................................................................................................8

3. Introduction Part 1 ..............................................................................................................................................91. What I am going to tell you...................................................................................................................................9

4. Integration with other Technologies ..................................................................................................................105. Frequently Asked Questions...............................................................................................................................11

1. Basics .............................................................................................................................................................. 121. Are there any Reasons not to use MongoDB? ...................................................................................................... 132. What are the supported Programming Languages? .............................................................................................. 143. What is the Status of Python 3 Support? ............................................................................................................ 154. What is the difference in the main Building-blocks to RDBMSs? ............................................................................. 16

2. Administration................................................................................................................................................... 171. Is there a Web GUI? What about a REST Interface/API? ....................................................................................... 182. Can I rename a Database? ............................................................................................................................... 193. How do I physically migrate a Database? ........................................................................................................... 20

1. Secure Copy .... as in scp .............................................................................................................................. 202. Minimum Downtime...................................................................................................................................... 20

4. How do I update to a new MongoDB version?...................................................................................................... 225. What is the default listening Port and IP? ........................................................................................................... 236. Is there a Way to do automatic Backups? ........................................................................................................... 247. What is getSisterDB() good for?........................................................................................................................ 258. How can I make MongoDB automatically start/restart on Server boot/reboot? ......................................................... 26

3. Resource Usage................................................................................................................................................. 271. Why is my Database growing so fast? ................................................................................................................ 282. What Caching Algorithm does MongoDB use?...................................................................................................... 293. Why does MongoDB use so much RAM? ............................................................................................................. 30

Page 3: Mongouk talk june_18

4. What is the so-called Working Set Size? ............................................................................................................. 315. How much RAM does MongoDB need?................................................................................................................ 32

1. Speed Impact of not having enough RAM ........................................................................................................ 326. Can I limit MongoDB's RAM Usage? ................................................................................................................... 337. What can I do about Out Of Memory Errors? ....................................................................................................... 34

1. OpenVZ ...................................................................................................................................................... 358. Does MongoDB use more than one CPU Core?..................................................................................................... 369. How can I tell how many clients are connected? .................................................................................................. 37

10. How many parallel Client Connections to MongoDB can there be? .......................................................................... 3811. Does MongoDB do Connection Pooling? .............................................................................................................. 3912. Is there a Size limit of how much Data can be stored inside MongoDB? .................................................................. 4013. Do embedded Documents count toward the 4 MiB BSON Document Size Limit? ....................................................... 4114. Does Document Size impact read/write Performance? .......................................................................................... 4215. Is there a Way to tell the Size of a specific Document?......................................................................................... 4316. How can I tell the Size of a Collection and its Indexes? ........................................................................................ 44

4. Collections / Namespaces ................................................................................................................................... 461. What is a Capped Collection? Why use it?........................................................................................................... 472. Can I rename a Collection?............................................................................................................................... 483. What is a Virtual Collection? Why use it? ............................................................................................................ 494. Can I use a larger Number of Collections/Namespaces?........................................................................................ 505. How about cloning a Collection? ........................................................................................................................ 516. Can I merge two or more Collections into one? ................................................................................................... 527. How can I get a list of Collections in my Database?.............................................................................................. 538. How do I delete a Collection?............................................................................................................................ 559. What is a Namespace with regards to MongoDB?................................................................................................. 56

10. How can I get a list of Namespaces in Database? ................................................................................................ 575. Statistics / Monitoring ........................................................................................................................................ 58

1. The Server Status, what does it tell? ................................................................................................................. 596. Schema / Configuration ...................................................................................................................................... 627. Indexes / Search / Metadata ............................................................................................................................... 638. Map / Reduce.................................................................................................................................................... 649. GridFS / Data Size ............................................................................................................................................. 65

Page 4: Mongouk talk june_18

1. What is GridFS? .............................................................................................................................................. 661. What can we do with GridFS .......................................................................................................................... 66

2. Why use GridFS over ordinary Filesystem Storage?.............................................................................................. 6710. Scalability / Fault Tolerance / Load Balancing ........................................................................................................ 6811. Miscellaneous.................................................................................................................................................... 69

6. Use Case ............................................................................................................................................................707. Summary Part 1 .................................................................................................................................................718. Introduction Part 2 ............................................................................................................................................729. Existing Technologies.........................................................................................................................................73

10. SQL to MongoDB Query Translation....................................................................................................................7411. Keeping things lazy... .........................................................................................................................................7512. Keeping Relations or Embedding? ......................................................................................................................76

1. Using References:.............................................................................................................................................. 772. Without references: ........................................................................................................................................... 783. Light and fast (For registered users): ................................................................................................................... 794. Heavy and slow (For any user): ........................................................................................................................... 795. Lazy relations or mongodb like ones:.................................................................................................................... 80

13. Taking Advantage from schema-less Databases for Web Development ..............................................................8114. Summary Part 2 .................................................................................................................................................83

Structure:

Markus

• 2min: tell the audience what I am going to tell them (a summary) and why I think it's worth mentioning• 3min: I'll start with a big picture view (how MongoDB just integrates nicely with existing setups eg folks can continue on using dm-

crypt/luks) basic principles like• 5min: pick a few FAQs items and elaborate on them eg "Why is MongoDB using so much RAM"

Page 5: Mongouk talk june_18

• 5min: I will then go on taking a use case as an example (a webapplication build with Django and MongoDB) from the financialdomain where we need transactions/locking/ACID and talk about the differences to eg MySQL/PostgreSQL

• 5min: also, with this use case, other things like: storing various precison numbers• 5min: summarize what I've told them

You start after me and drill down on details (the stuff you mentioned in your email ~9 days ago) or whatever you/we see fit.

Flavio

• 2min: I'll tell the audience the topics I'll talk about and how they help us with mongodb and django integration• 5min: Mappers & Stack, I'll list some of the current ODM's used to integrate mongodb and django and how django-mongodb-engine

integrates with django and mongodb.• 5min: I'll talk about queries, what we have in sql that we don't have in mongodb and how we can obtain the same results using it

◦ perfect, nothing to add/change here• 3min: I'll talk about embedding and referencing, when it worths doing each and why• 5min: I'll talk about how it is possible to take advantage of schemeless databases in web programming (django oriented)

◦ ok sounds good, not sure I understand exactly; approach me today on #sunoano and give me an example• 5min: Summarize and maybe some benchmark!!!

Page 6: Mongouk talk june_18

Who are we?

Still, with all the technology we have these days, at the end of the day it is all about the people ...

/me definitely not a

Page 7: Mongouk talk june_18

Markus Gattol

• grown up in Carinthia (southernmost Austrian state, bordering Italy), lives in the UK now◦ http://sunoano.name/albums/places/austria/index.html

• technical background, MSc (Computer Science, Electrical Engineering)• with Linux (Debian) since 1995, Contributor• RDBMSs, the usual ...• Open Source Developer/Contributor in general• website http://sunoano.name

◦ http://sunoano.name/ws/mongodb.html• works for Heart Internet Ltd., NSN before that

◦ http://www.heartinternet.co.uk

Page 8: Mongouk talk june_18

Flavio Percoco Premoli

• GNOME a11y Contributor (MouseTrap [http://live.gnome.org/MouseTrap])• Open Source Developer/Contributor (Web and Desktop)• R&D Developer at The Net Planet Europe

◦ NoSQL Technologies◦ Cloud Computing◦ Knowledge Management Systems

• Linux Lover/User and Mac user too• website: http://www.flaper87.org• Twitter: FlaPer87• Github: FlaPer87• Bitbucket: FlaPer87• Everywhere else: FlaPer87

Page 9: Mongouk talk june_18

Introduction Part 1

The why ...

1. why are you here today?2. why does some business want to know about new technology?3. why are we looking to move away from RDBMs to NoSQL DBMSs?4. German: Hardware und Software sind dann gut, wenn sie sich verstehen lassen, während man sie benutzt - und nicht, wenn

man damit vielleicht zum Mars fliegen kann.

Part 1 is mainly about MongoDB itself and not about Django/Python .... Part 2? .... Django!

What I am going to tell you

Best listener experience possible ...

Introduction Part 1 ... Tell the audience what you're going to tell themTell them

Integration with other TechnologiesFrequently Asked QuestionsUse Case

Summary Part 1 ... Tell the audience what you told them

Page 10: Mongouk talk june_18

Integration with other Technologies

• How can I get MongoDB?• Ok, have it! Now what?

1. full-disk encryption / filesystem-level encryption2. backup technologies, Rsync/Unison, Bacula, Amanda3. LVM4. VPN, SSH5. Virtualization, OpenVZ

Page 11: Mongouk talk june_18

Frequently Asked Questions

Well, just because ...

Page 12: Mongouk talk june_18

Basics

Before we start running we need to be able to walk ...

Page 13: Mongouk talk june_18

Are there any Reasons not to use MongoDB?

1. We need transactions (ACID (Atomicity, Consistency, Isolation, Durability)).2. Our data is very relational.3. Related to 2, we want to be able to do joins on the server (but can not do embedded objects / arrays).4. We need triggers on our tables. There might be triggers available soon however.5. We rely on triggers (or similar functionality) for cascading updates or deletes.6. We need the database to enforce referential integrity (MongoDB has no notion of this at all).7. If we need 100% per node durability.8. Write ahead log. MongoDB does not have one simply because it does not need one.9. Dynamic aggregation with ad-hoc queries; Crystal reports, reporting, business logic, ... RDBMSs heartland ...

Page 14: Mongouk talk june_18

What are the supported Programming Languages?

Right now (June 2010) we can use MongoDB from at least C, C++, C#, .NET, ColdFusion, Erlang, Factor, Java,Javascript, PHP, Python, Ruby, Perl. Of course, there might be more languages available in the future.

Page 15: Mongouk talk june_18

What is the Status of Python 3 Support?

The current thought is to use Django as more or less a signal for when adding full support for Python 3 makes sense.MongoDB can probably support it a bit earlier than Django does, but that is certainly not something the MongoDB communitywants to rush and then have to support two totally different code bases.

Page 16: Mongouk talk june_18

What is the difference in the main Building-blocks to RDBMSs?

We have RDBMSs like for example MySQL, Oracle, PostgreSQL and then there are NoSQL DBMSs like for example MongoDB.Below is a breakout about how MongoDB relates tothe afore mentioned, it is a breakout about how the main building blocks of each party resemble:

MySQL, PostgreSQL, Oracle--------------------------------------------Server:Port- Database

- Table- Row

MongoDB--------------------------------------------Server:Port- Database

- Collection- Document

Page 17: Mongouk talk june_18

Administration

The usual handicraft work ... get and keep it running ... if in doubt, automate!

Page 18: Mongouk talk june_18

Is there a Web GUI? What about a REST Interface/API?

• assuming a mongod process is running on localhost then we can access some statistics at http://localhost:28017/ andhttp://localhost:28017/_status

• In order to have a REST interface to MongoDB, same as CouchDB has it, we have to start mongod with the --rest switch.◦ Note however that this is just a read-only REST interface.

• For a read and/or write REST interface:◦ http://www.mongodb.org/display/DOCS/Http+Interface◦ http://github.com/kchodorow/sleepy.mongoose◦ http://github.com/tdegrunt/mongodb-rest

• If we wanted real-time updates from the CLI, then we could also use mongostat.

Page 19: Mongouk talk june_18

Can I rename a Database?

Yes, but it is not as easy as renaming a collection. As of now, the recommended way to rename a database is to clone itand thereby rename it. This will require enough additional free disk space to fit the current/old database at least twice.

Page 20: Mongouk talk june_18

How do I physically migrate a Database?

There is even a clone command for that. Note however that neither copyDatabase() nor cloneDatabase() actually perform apoint-in-time snapshot of the entire database -- what they basically do is query the source database and thenreplicate to the target database i.e. if we use copyDatabase() or cloneDatabase() on a source database which is onlineand has operations performed on it, then the target database cannot be a point-in-time snapshot pointing to theexact time when either one command was issued. Rather, at some point in time, they will/might have the same data/state astheir source database.

Secure Copy .... as in scp

A bit downtime but the chance to resume a canceled transfer ....

• shutdown mongod on the old machine• copied/sync the database directory to the new machine• start mongod on the new machine with dbpath set appropriately

◦ http://sunoano.name/ws/debian_notes_cheat_sheets.html#resume_an_scp_transfer

Minimum Downtime

Below is what we could do in order to have as little downtime as possible:

• stop and re-start the existing mongod as master (if it is not already running as master that is)• install mongod on the new machine and configure it as slave using --slave and --source• wait while the slave copies the database, re-indexes and then catches up with its master (this happens

automatically when we point a slave to its master). Once the slave has caught up, we• disable writes to the master (clients can still read/query)• once all outstanding writes have been committed on the master and the slave caught up, we shutdown the master

and restart the slave as new master. The old master can now be removed entirely.• now we point all traffic at the new master

Page 21: Mongouk talk june_18

• finally we enable writes on the new master again, ... Et voilà!

Of course, we might also use OpenVZ and its live-migration feature ...

Page 22: Mongouk talk june_18

How do I update to a new MongoDB version?

If it is a drop-in replacement we just need to shutdown the older version and start the new one with theappropriate dbpath. Otherwise, i.e. if it is not a drop-in replacement, we would use mongoexport followed bymongoimport.

Page 23: Mongouk talk june_18

What is the default listening Port and IP?

We can use netstat to find out:

wks:/home/sa# netstat -tulpena | grep mongotcp 0 0 0.0.0.0:27017 0.0.0.0:* LISTEN 124 1474236 8822/mongodtcp 0 0 0.0.0.0:28017 0.0.0.0:* LISTEN 124 1474237 8822/mongodwks:/home/sa#

The default listening port for mongod is 27017. 28017 is where we can point our web browser in order to get somestatistics. The default listening IPs are all local IPs i.e. 0/0 which matches all source addresses from 0.0.0.0 withnetmask 0.0.0.0 i.e all source addresses from the local machine ... plus ...

And yes, this includes the loopback device/address/network 127.0.0.0/8, the private class A network 10.0.0.0/8, theprivate class B network 172.16.0.0/12 and of course also the private class C network 192.168.0.0/16 amongst others.

Both, listening port and IP address, can be changed either by using the CLI switches --port and --bind_ip or theconfiguration file which we can figure out by looking at the runtime configuration.

Page 24: Mongouk talk june_18

Is there a Way to do automatic Backups?

Yes, http://github.com/micahwedemeyer/automongobackup

Page 25: Mongouk talk june_18

What is getSisterDB() good for?

We can use it to get ourselves references to databases which not just saves a lot of typing but is, once we got used tousing it, a lot more intuitive:

1 sa@wks:~/mm/new$ mongo2 MongoDB shell version: 1.5.2-pre-3 url: test4 connecting to: test5 type "help" for help6 > db.getCollectionNames();7 [ "fs.chunks", "fs.files", "people", "system.indexes", "test" ]8 > reference_to_test_db = db.getSisterDB('test');9 test

10 > reference_to_test_db.getCollectionNames();11 [ "fs.chunks", "fs.files", "people", "system.indexes", "test" ]12 > use admin13 switched to db admin14 > reference_to_test_db.getCollectionNames();15 [ "fs.chunks", "fs.files", "people", "system.indexes", "test" ]16 > bye17 sa@wks:~/mm/new$

Note how we get a reference to our test database in line 8 and how it is used in lines 10 and even line 14, after switching fromour test database to the admin database. getCollectionNames() has just been chosen as an example, it could have been anyother command as well of course.

Page 26: Mongouk talk june_18

How can I make MongoDB automatically start/restart on Server boot/reboot?

One way would be to use the @reboot directive with Cron. However, .deb and .rpm packages install init scripts (sysv orupstart style, as appropriate) on Debian, Ubuntu, Fedora, and CentOS already so MongoDB will restart there withoutfurther need from us to do anything special.

• For other constellations, http://gist.github.com/409301is an init.d script for Unix-like systems based onhttp://bitbucket.org/bwmcadams/toybox/src/3e84be941408/mongodb.init.rhel.

• For Mac OS X, people have reported that launchctl configurations like http://github.com/AndreiRailean/MongoDB-OSX-Launchctl/blob/master/org.mongo.mongod.plist work.

• For Windows, we have http://www.mongodb.org/display/DOCS/Windows+Service documentation.

Page 27: Mongouk talk june_18

Resource Usage

Lot's of confusion amongst beginners ...

Page 28: Mongouk talk june_18

Why is my Database growing so fast?

The first file for a database is dbname.0, then dbname.1, etc. dbname.0 will be 64 MiB, dbname.1 128 MiB, ... up to 2 GiB.Once the files reach 2 GiB in size, each successive file is also 2 GiB.

So, if we have say, database files up to dbname.n, then dbname.n-1 might be 90% unused but dbname.n has already beallocated once we start using dbname.n-1. The reasoning here is simple: we do not want to wait for new database fileswhen we need them so we always allocate the next one in the background as soon as we start to use an emptyone.

Note that deleting data and/or dropping a collection or index will not release already allocated disk space since it isallocated per database. Disk space will only be released if a database is repaired or the database is dropped altogether. Go tohttp://www.mongodb.org/display/DOCS/Developer+FAQ#DeveloperFAQ-Whyaremydatafilessolarge%3F for more information.

Page 29: Mongouk talk june_18

What Caching Algorithm does MongoDB use?

Actually, that is done by the OS using the LRU (Least Recently Used) caching pattern.

Page 30: Mongouk talk june_18

Why does MongoDB use so much RAM?

Well, it does not actually, it is just that most folks do not really understand memory management -- there is more to it thanjust is in RAM or is not in RAM.

The current default storage engine for MongoDB is called MongoMemMapped_RecStore. It uses memory-mapped files forall disk I/O operations. Using this strategy, the operating system's virtual memory manager is in charge of caching.This has several implications:

• There is no redundancy between file system cache and database cache, actually, they are one and the same.• MongoDB can use all free memory on the server for cache space automatically without any configuration of a cache size.• Virtual memory size and RSS (Resident Set Size) will appear to be very large for the mongod process. This is benign

however -- virtual memory space will be just larger than the size of the datafiles open and mapped i.e. resident size willvary depending on the amount of memory not used by other processes on the machine.

• Caching behavior (such as LRU'ing out of pages, and laziness of page writes) is controlled by the operating system. Thequality of the VMM (Virtual Memory Manager) implementation will vary by OS.

As of now, an alternative storage engine (CachedBasicRecStore), which does not use memory-mapped files, is underdevelopment. This engine is more traditional in design with its own page cache. With this store the database has more controlover the exact timing of reads and writes, and of the cache LRU strategy.

Generally, the memory-mapped store (MongoMemMapped_RecStore) works quite well. The alternative store will be useful incases where an operating system's VMM is behaving suboptimal.

Page 31: Mongouk talk june_18

What is the so-called Working Set Size?

Working set size can roughly be thought of as how much data we will need MongoDB (or any other DBMS, relational ornon-relational) to access in a period of time.

For example, YouTube has ridiculous amounts of data, but only 1% may be accessed at any given time. If, however, we arein the rare case where all the data we store is accessed at the same rate at all times (LRU), then our working set size can bedefined as our entire data set stored in MongoDB.

Page 32: Mongouk talk june_18

How much RAM does MongoDB need?

We now know MongoDB's caching pattern, we also know what a working set size is. Therefore we can have the following ruleof thumb on how much RAM a machine needs in order to work properly.

It is the working set size plus MongoDB's indexes which should reside in RAM at all times i.e. the amount of availableRAM should be at least the working set size plus the size of indexes plus what the rest of the OS and other software runningon the same machine needs.

Speed Impact of not having enough RAM

Generally, when databases are to big to fit into RAM entirely, and if we are doing random access, we are introuble as HDDs are slow at that (roughly a 100 operations per second per drive).

One solution is to have lots of HDDs (10, 100, ...). Another one is to use SSDs (Solid State Drives) or, even better,add more RAM. Now that being said, the key factor here is random access. If we do sequential access to databigger than RAM, then that is fine.

So, it is ok if the database is huge (more than RAM size), but if we do a lot of random access to data, it is best ifthe working set fits in RAM entirely.

However, there are some nuances around having indexes bigger than RAM with MongoDB. For example, we canspeed up inserts if the index keys have certain properties -- if inserts are an issue, then that would help.

Page 33: Mongouk talk june_18

Can I limit MongoDB's RAM Usage?

No, it is not designed to do that, it is designed for speed and scalability.

If we wanted to run MongoDB on the same physical machine alongside some web server and for example some applicationserver like Django, then we could ensure memory limits on each one by simply using virtualization and putting each one inits own VE (Virtual Environment). In the end we would thus have a web application made of MongoDB, Django and forexample Cherokee, all running on the same physical machine but being limited to whatever limits we set on each VE they runin.

Page 34: Mongouk talk june_18

What can I do about Out Of Memory Errors?

If we are getting something like this Fri May 21 08:29:52 JS Error: out of memory (or akin stuff) in our logs, then we hit amemory limit.

As we already know, MongoDB takes all RAM it can get i.e. RAM, or more precisely RSS (Resident Set Size), itself part ofvirtual memory, will appear to be very large for the mongod process.

The important point here is how it is handled by the OS. If the OS just blocks any attempt to get more virtualmemory or, even worse, kills the process (e.g. mongod) which tries to get more virtual memory, then we have got aproblem. What can be done is to elevated/alter a few settings:

1 sa@wks:~$ ulimit -a | egrep virtual\|open2 open files (-n) 10243 virtual memory (kbytes, -v) unlimited4 sa@wks:~$ lsb_release -irc5 Distributor ID: Debian6 Release: unstable7 Codename: sid8 sa@wks:~$ uname -a9 Linux wks 2.6.32-trunk-amd64 #1 SMP Sun Jan 10 22:40:40 UTC 2010 x86_64 GNU/Linux

10 sa@wks:~$

As we can see from lines 5 to 9, I am on Debian sid (still in development) running the 2.6.32 Linux kernel.

The settings we are interested in are with lines 2 and 3. Virtual memory is unlimited by default so that is fine already --this is actually what causes the most problems so we need to make sure virtual memory is either reasonably high or, evenbetter, set to unlimited as shown above. With regards to allowed open file descriptors -- by default we are limited to 1024open files which, in some cases, might pose a problem -- simply elevating it might be enough already and make memory

Page 35: Mongouk talk june_18

errors go away.

Note that we need to run these commands (e.g. ulimit -v unlimited) in the same user context as mongod i.e. we basicallywant to script them as part of our mongod startup process.

OpenVZ

If we are running MongoDB with OpenVZ then there are some more settings we might want to tune in order to avoid theOOM (Out of memory) killer to kick in or simply hit the virtual memory ceiling if not set to unlimited. Special attentionshould be paid to the OpenVZ memory settings i.e. they should be set to reflect MongoDB's memory usage.

Page 36: Mongouk talk june_18

Does MongoDB use more than one CPU Core?

For write operations MongoDB makes use of one CPU core. For read operations however, which tend to be themajority of operations, MongoDB uses all CPU cores available to it.

In short: one will notice a speed increase going from a single-core CPU to dual-core or even higher e.g. quad-coreor maybe even octo-core since the speed increase is roughly proportional to the available CPU cores.

Page 37: Mongouk talk june_18

How can I tell how many clients are connected?

We can look at the connections field (current) with the server status:

sa@wks:~$ mongo --quiettype "help" for help> db.serverStatus();{

[skipping a lot of lines ...]

"connections" : {"current" : 2,"available" : 19998

},

[skipping a lot of lines ...]

}> byesa@wks:~$

Page 38: Mongouk talk june_18

How many parallel Client Connections to MongoDB can there be?

Have a look at the connections field (available) with the server status.

Page 39: Mongouk talk june_18

Does MongoDB do Connection Pooling?

Yes, we can do connection pooling for performance reasons and overall resource usage optimization -- without it thingswould be a lot slower and resource intensive. Fact is that as of now (June 2010) most of the client drivers do connectionpooling, how exactly it is done varies with driver e.g. PyMongo.

Page 40: Mongouk talk june_18

Is there a Size limit of how much Data can be stored inside MongoDB?

4 MiB is the limit on individual documents, but GridFS uses many documents, so there is no limit, technically/practically speaking.

As the above is true for x86-64, it is not entirely true for x86 (32 bit) -- there is a limit because of how memory mapped fileswork whichis a limit of 2GiB per database.

Page 41: Mongouk talk june_18

Do embedded Documents count toward the 4 MiB BSON Document Size Limit?

Yes, the entire BSON (Binary JSON) document (including all embedded documents, etc.) cannot be more than 4 MiB in size.

Page 42: Mongouk talk june_18

Does Document Size impact read/write Performance?

Yes, but this is mostly due to network limitations e.g. one will max out a GigE link with inserts before document size startsto slow down MongoDB itself.

Page 43: Mongouk talk june_18

Is there a Way to tell the Size of a specific Document?

Yes, one can use Object.bsonsize(db.whatever.findOne()) in the shell like this:

sa@wks:~$ mongoMongoDB shell version: 1.5.1-pre-url: testconnecting to: testtype "help" for help> db.test.save({ name : "katze" });> Object.bsonsize(db.test.findOne({ name : "katze"}))38> byesa@wks:~$

Page 44: Mongouk talk june_18

How can I tell the Size of a Collection and its Indexes?

sa@wks:~$ mongo --quiettype "help" for help> db.getCollectionNames();[ "fs.chunks", "fs.files", "people", "system.indexes", "test" ]> db.test.dataSize();160> db.test.storageSize();2304> db.test.totalIndexSize();8192> db.test.totalSize();10496

We are using the test collection here. dataSize() is self-explanatory. storageSize() includes our data and all the still freebut already allocated disk space to this collection. totalIndexSize() is the size in bytes of all the indexes in thiscollection and totalSize() is all the storage allocated for all data and indexes in this collection. If we need/want amore detailed view we could also have a look at

> db.test.validate();{

"ns" : "test.test","result" : "

validatefirstExtent:2:2b00 ns:test.testlastExtent:2:2b00 ns:test.test# extents:1datasize?:160 nrecords?:4 lastExtentSize:2304

Page 45: Mongouk talk june_18

padding:1first extent:

loc:2:2b00 xnext:null xprev:nullnsdiag:test.testsize:2304 firstRecord:2:2be8 lastRecord:2:2c58

4 objects found, nobj:4224 bytes data w/headers160 bytes data wout/headersdeletedList: 0000001000000000000deleted: n: 1 size: 1904nIndexes:1

test.test.$_id_ keys:4",

"ok" : 1,"valid" : true,"lastExtentSize" : 2304

}> byesa@wks:~$

Note that while MongoDB generally does a lot of pre-allocation, we can remedy this by starting mongod with --nopreallocand --smallfiles.

Page 46: Mongouk talk june_18

Collections / Namespaces

Needs to be known, plain and simple ...

Page 47: Mongouk talk june_18

What is a Capped Collection? Why use it?

• Size: http://www.mongodb.org/display/DOCS/Capped+Collections• Time (TTL Collections): http://jira.mongodb.org/browse/SERVER-211

Page 48: Mongouk talk june_18

Can I rename a Collection?

Yes. Using help(); from MongoDB's interactive shell we get, amongst others, db.test.renameCollection( newName ,<dropTarget> ) which renames the collection. So yes, we could do db.foo.renameCollection('bar'); and have the collection foorenamed to bar. Renaming a collection is an atomic operation by the way.

Page 49: Mongouk talk june_18

What is a Virtual Collection? Why use it?

It refers to the ability to reference embedded documents as if they were a first-class collection of top leveldocuments, querying on them and returning them as stand-alone entities, etc.

Page 50: Mongouk talk june_18

Can I use a larger Number of Collections/Namespaces?

There is a limit to how much collections/namespaces we can have within a single MongoDB database. It is ~24000namespaces per database. This is essentially the number of collections plus the number of indexes.

Page 51: Mongouk talk june_18

How about cloning a Collection?

Yes, possible. Have a look at mongoexport and mongoimport.

Page 52: Mongouk talk june_18

Can I merge two or more Collections into one?

Yes, we read from all collections we want to merge and use insert() to write it into our single target collection. Thiscan be done on the server (using MongoDB's interactive shell) or from a client.

Page 53: Mongouk talk june_18

How can I get a list of Collections in my Database?

We can use getCollectionNames() as shown below in lines 8 and 9. Yet another possibility is shown in lines 23 to 28. Ofcourse, since every collection is also a namespace, we can find them aside indexes in lines 11 to 21:

1 sa@wks:~$ mongo2 MongoDB shell version: 1.2.43 url: test4 connecting to: test5 type "help" for help6 > db7 test8 > db.getCollectionNames();9 [ "fs.chunks", "fs.files", "mycollection", "system.indexes", "things" ]

10 > db.system.namespaces.find();11 { "name" : "test.system.indexes" }12 { "name" : "test.fs.files" }13 { "name" : "test.fs.files.$_id_" }14 { "name" : "test.fs.files.$filename_1" }15 { "name" : "test.fs.chunks" }16 { "name" : "test.fs.chunks.$_id_" }17 { "name" : "test.fs.chunks.$files_id_1_n_1" }18 { "name" : "test.things" }19 { "name" : "test.things.$_id_" }20 { "name" : "test.mycollection" }21 { "name" : "test.mycollection.$_id_" }23 > show collections24 fs.chunks25 fs.files26 mycollection

Page 54: Mongouk talk june_18

27 system.indexes28 things29 > bye30 sa@wks:~$

Page 55: Mongouk talk june_18

How do I delete a Collection?

db.collection.drop() but there is no undo so beware.

Page 56: Mongouk talk june_18

What is a Namespace with regards to MongoDB?

Collections can be organized in namespaces. These are named groups of collections defined using a dot notation. Forexample, we could define collections blog.posts and blog.authors, both reside under the namespace blog but are two separatecollections.

Namespaces can then be used to access these collections using the dot notation e.g. db.blog.posts.find(); will return alldocuments from the collection blog.posts but nothing from the collection blog.authors.

Namespaces simply provide an organizational mechanism for the user i.e. the collection namespace is flat from thedatabase point of view which means that blog.authors really just is a collection on its own and not some collection authorsgrouped under some namespace blog. Again, the collection namespace is flat from the database point of view i.e. technicallyspeaking blog.authors is no different than foo or foo.bar.baz -- grouping just helps the humans keep track ...

Page 57: Mongouk talk june_18

How can I get a list of Namespaces in Database?

One way to list all namespaces for a particular database would be to enter MongoDB's interactive shell:

sa@wks:~$ mongoMongoDB shell version: 1.2.4url: testconnecting to: testtype "help" for help> db.system.namespaces.find();{ "name" : "test.system.indexes" }{ "name" : "test.fs.files" }{ "name" : "test.fs.files.$_id_" }{ "name" : "test.fs.files.$filename_1" }{ "name" : "test.fs.chunks" }{ "name" : "test.fs.chunks.$_id_" }{ "name" : "test.fs.chunks.$files_id_1_n_1" }{ "name" : "test.things" }{ "name" : "test.things.$_id_" }{ "name" : "test.mycollection" }{ "name" : "test.mycollection.$_id_" }> db.system.namespaces.count();11> byesa@wks:~$

The system namespace in MongoDB is special since it contains database system information (read metadata). There areseveral collections like for example system.namespaces which for example can be used to get information about all thenamespaces with some database.

Page 58: Mongouk talk june_18

Statistics / Monitoring

Because pilots need to know ...

Page 59: Mongouk talk june_18

The Server Status, what does it tell?

sa@wks:~$ mongo --quiettype "help" for help> db.serverStatus();{

"uptime" : 6695,"localTime" : "Sun Apr 11 2010 11:22:19 GMT+0200 (CEST)","globalLock" : {

"totalTime" : 6694193239,"lockTime" : 45048,"ratio" : 0.000006729414343397326

},"mem" : {

"resident" : 3,"virtual" : 138,"supported" : true,"mapped" : 0

},

Most of it is obvious like for example uptime. The globalLock part is interesting. totalTime is the same as uptime but inmicroseconds. lockTime is the amount of time the global lock has been held i.e. the total time spend waiting for writequeries until a lock has been assigned and thus a write could be made.

One may ask what is the point of having both, uptime and totalTime? Well, totalTime will rollover faster since it is inmicroseconds, at some point they diverge. The rollover is coordinated between totalTime and lockTime.

mem units are in MiB, all of them. resident, what is in physical memory (also known as RAM), virtual is the virtualaddress space, mapped is the space memory mapped, and supported is if memory info is supported on our platform.

Page 60: Mongouk talk june_18

"connections" : {"current" : 2,"available" : 19998

},"extra_info" : {

"note" : "fields vary by platform","heap_usage_bytes" : 146048,"page_faults" : 57

},"indexCounters" : {

"btree" : {"accesses" : 0,"hits" : 0,"misses" : 0,"resets" : 0,"missRatio" : 0

}},"backgroundFlushing" : {

"flushes" : 111,"total_ms" : 2,"average_ms" : 0.018018018018018018,"last_ms" : 0,"last_finished" : "Sun Apr 11 2010 11:21:45 GMT+0200 (CEST)"

},

connections tells us how many client connections we can open against mongod, more precisely, current tells us howmany existing client connections to mongod there are right now and available shows us how many we got left.

Within the extra_info part we have heap_usage_bytes which is the main memory needed by the database.

Page 61: Mongouk talk june_18

"opcounters" : {"insert" : 16513,"query" : 1482263,"update" : 141594,"delete" : 38,"getmore" : 246889,"command" : 1247316

},"asserts" : {

"regular" : 0,"warning" : 0,"msg" : 0,"user" : 0,"rollovers" : 0

},"ok" : 1

}> byesa@wks:~$

The opcounters part is also pretty interesting. insert, query, update, and delete are self-explanatory but getmore andcommand are probably not. When we do a query, we get results in batches. The first batch is counted in query, allsubsequent in getmore. commands are things like count, group, distinct, etc.

And yes, taking those numbers and dividing them by time (delta or total) will give us operations/time e.g. operationsper second or operations since mongod got started. In fact, there is a Munin plugin (http://github.com/erh/mongo-munin)which does use this.

Page 62: Mongouk talk june_18

Schema / Configuration

Sorry folks, no can do, lack of time ... go to http://sunoano.name/ws/mongodb.html#faqs_schema_configuration

Page 63: Mongouk talk june_18

Indexes / Search / Metadata

Sorry folks, no can do, lack of time ... go to http://sunoano.name/ws/mongodb.html#faqs_indexes_search_metadata

Page 64: Mongouk talk june_18

Map / Reduce

Sorry folks, no can do, lack of time ... go to http://sunoano.name/ws/mongodb.html#faqs_map_reduce

Page 65: Mongouk talk june_18

GridFS / Data Size

Store tons of data reliable and smart ...

Page 66: Mongouk talk june_18

What is GridFS?

Basically a collection of normal documents. We have two collections, one for metadata (fs.files) and one consisting ofchunks of data (fs.chunks).

The GridFS spec provides a mechanism for transparently dividing a large file among multiple documents. This allowsus to efficiently store large objects, and in the case of especially large files, such as videos, permits range operations(e.g., fetching only the first n bytes of a file).

What can we do with GridFS

Store ridcoulous amounts of data in a smart way.

Page 67: Mongouk talk june_18

Why use GridFS over ordinary Filesystem Storage?

If we use the filesystem we would have to handle backup/replication/scaling ourselves. We would also have to come upwith some sort of hashing scheme ourselves plus we would need to take care about cleanup/sorting/moving becausefilesystems do not love lots of small files.

With GridFS, we can use MongoDB's built-in replication/backup/scaling e.g. scale reads by adding more read-onlyslaves and writes by using sharding. We also get out of the box hashing (read UUID (Universally Unique Identifier)) forstored content plus we do not suffer from filesystem performance degradation because of a myriad of small files.

Also, we can easily access information from random sections of large files, another thing traditional tools working withdata right off the filesystem are not good at. Last but not least, we can keep information associated with the file (who hasedited it, download count, description, etc.) right with the file itself.

Page 68: Mongouk talk june_18

Scalability / Fault Tolerance / Load Balancing

Sorry folks, no can do, lack of time ... go to http://sunoano.name/ws/mongodb.html#faqs_scalability_fault_tolerance_load_balancing

Page 69: Mongouk talk june_18

Miscellaneous

Sorry folks, no can do, lack of time ... go to http://sunoano.name/ws/mongodb.html#faqs_miscellaneous

Page 70: Mongouk talk june_18

Use Case

This should have been my major part◦ locking (read transactions)◦ asynchronous as opposed to synchronous operations◦ numbers (double precision)

Again, lack of time ... go to http://sunoano.name/ws/mongodb.html

Page 71: Mongouk talk june_18

Summary Part 1

Tell them what you told them ... simple as that ...

Page 72: Mongouk talk june_18

Introduction Part 2

Before starting with mongodb specific topics it's important to know that we don't dislike relational databases, we know theyare good for many things but we also know that web applications success is mainly based on their performance and speedso that's what we're running after and that's why we're all here.

Page 73: Mongouk talk june_18

Existing Technologies

• MongoKit (Nicolas Clairon):◦ Great for completely unstructured model programming. It has structure validation but I’ve never used it, I prefer

to use mongokit on models that may be constantly changing their structure.

• mongoengine (Harry Marr):◦ It allows you to define schemas for documents and query collections using django-like syntax.

• django-mongodb-engine (Alberto Paro and myself):◦ This is a real Django backend based on django-mongodb and mongoengine, adapted to work with django-

nonrel and mongodb without changing anything in the code.

Page 74: Mongouk talk june_18

SQL to MongoDB Query Translation....

"What matters is who adapts faster to the changing conditions"- Charles Darwin

The first we should remember when passing from SQL databases to NoSQL ones is that models were made to model data but,models can be modeled too, what I mean is that people use to adapt databases features to their models instead of adaptingmodels to databases. I'll try to mention some of the common quesitons found in the m-l:

• Lets start with JOINS. Why JOINS? Because we don’t have those in MongoDB and we might need them so, we have tofigure out what’s the best workaround for this. The best thing you can do here is forget about JOINS, you wont havethem we are not talking about highly relational databases we are talking about non relational ones so there can't be joinsbetween 2 collections if there's no relation between them. One of the things we did was remodeling the way we storeddata. We embedded what could be embedded and did 2 or more queries where embedding was not possible.

• What about ForeignKeys, do we have those? Yes, or kind off. We have DBRef which is a kind of ForeignKey but Ipersonally wouldn't use refs in mongodb. As I said, MongoDB is not about referencing and collection relations it is aboutperformance based on dynamism.

• If MongoDB barely has references you could guess that many to many is insignificant, instead of that I would startthinking on dicionaries/maps and lists/arrays.

• And last but not least, If you really need to do a query that joins 2 collections based on a field reference that shouldhandle a many to many relation then you have map/reduce.

Page 75: Mongouk talk june_18

Keeping things lazy...

Yes, because we’re lazy people so we do lazy things ...

It is important when getting orms to work with mongodb that we keep things lazy to avoid bottle necks in our web applications.Mongodb doesn't have many to many relations but it can have lists and dictionaries saved. For example

class User(models.Model)nickname = models.CharField(max_length=255)full_name = models.CharField(max_length=255)friends = ListField()groups = ListField()

In the User model we have 2 ListFields that may cause some slow downs in our web application, the first one is a list containingids/names of the user friends and the second one containing the groups user is related to so, think of a user that have manyfriends and that is related to many groups (a popular one), that's a lot of data transfer and many instantiations for our codebecause each object/id in the ListField should be instantiated. Maybe this might sound obvios but trust me, nothing is obviouswhen doing web programming.

Page 76: Mongouk talk june_18

Keeping Relations or Embedding?

This is a common question when moving from relational databases to non-rel ones. Should we keep our models related or embedsmallest ones into the biggest ones?. The answer is NO, you shouldn't keep them related. For Example, A common situation (orcommonly used to show how mongodb works) is a blog engine with posts and comments. Lets see how we could handlecomments (not threaded) in our blog engine:

Page 77: Mongouk talk june_18

Using References:

class Comment(models.Model):post = models.ForeignKey(Post)user = models.ForeignKey(User)text = models.CharField(max_length=255)

my_comment, created = Comment.objects.get_or_create(post=my_post, user=my_user, text=my_text,defaults={})

Page 78: Mongouk talk june_18

Without references:

class Post(models.Model)....comments = ListField()

post.comments.append({ ‘user’ : user, ‘text’ : text})post.save()

The first example is the most used because is the way we're used to think when we write our models but, the second one is theright one when talking about nosql databases because references make things slower.

The bad thing about embedding our comments like that is that we have to worry about our 4mb Document limit so if we arereally popular on the net and many people comes to our blog and comments our posts, that might be a problem for us, eventhough, This is great, I mean, we have removed a model from our app so it should be easier to maintain, shouldn't it? but, whatis user supposed to be? Is it an embedded user object? is it a ForeignKey? what is it? How should we handle users there?

It again depends on how you'd like to do things, for example It is possible to save the username as it should be showed and thenwhen the comments are loaded just show the username, for those wanting to know more about this user then it is possible to dothat just by clicking on its username it'll load the user's personal info. Here are some examples:

Page 79: Mongouk talk june_18

Light and fast (For registered users):

post.comments.append({'user' : 'FlaPer87', 'text' : 'My Comment'})post.save()

Heavy and slow (For any user):

post.comments.append({'user' : {'username' : 'FlaPer87','email' : '[email protected]','url' : 'http://blog.flaper87.org'},

'text' : 'My Comment'})post.save()

Page 80: Mongouk talk june_18

Lazy relations or mongodb like ones:

#Automatic serialization done in django-mongodb-enginepost.comments.append({'user' : {'_app': model._meta.app_label,

'_model': model._meta.module_name,'pk': model.pk,'_type': "django"},'text' : 'My Comment'})

post.save()

Page 81: Mongouk talk june_18

Taking Advantage from schema-less Databases for WebDevelopment

One of the things I like more from mongodb is that it is schema-less. People use to think about schema-less dbs as a mess whichthey're not. Schema-less databases do have a structure the difference between them and Schema based ones is that theschema-less structures are dynamic, this means that they can be modified at anytime and they're not typed, you can think aboutschema-less dbs as (just like mongodb does) json based maps.

This kind of structures can be really helpful when doing web programing, in our case they let us save any kind of data in ourcollections and have generic structures that changed during the time. For example, let's try to improve our Comment model (incase we decided to have some relations).

Page 82: Mongouk talk june_18

class Comment(models.Model):

post = models.ForeignKey(Post)

user = GenericField()

text = models.CharField(max_length=255)

my_user = "FlaPer87" #Known User

my_comment, created = Comment.objects.get_or_create(post=my_post,

user=my_user,

text=my_text, defaults={})

my_user = {'nickname' : 'FlaPer87',

'full_name' : 'Flavio Percoco Premoli',

'email' : '[email protected]',

'url' : 'http://blog.flaper87.org'} #Anonymous User

my_comment2, created = Comment.objects.get_or_create(post=my_post,

user=my_user,

text=my_text,

defaults={})

Using a GenericField we'll be able to save anything into that attr and we'll have to do our checks and controls code side. In thiscase the Schema-less collection helped us to get/save the anonymous users information without having to create a record in ourUsers table or without forcing the user to register.

Page 83: Mongouk talk june_18

Summary Part 2

• Re-model your models• Be Lazy to be faster• Forget about relations, they will slow you down• Remember that dynamism is better than restrictions

Page 84: Mongouk talk june_18

Recommended