+ All Categories
Home > Documents > Exercise11 MongoDB Solution

Exercise11 MongoDB Solution

Date post: 19-Dec-2021
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
19
Big Data for Engineers – Exercises Spring 2020 – Week 11 – ETH Zurich MongoDB Introduction This exercise will cover document stores. As a representative of document stores, MongoDB was chosen for the practical exercises. Instructions are provided to install it on the Azure Portal. 1. Document stores A record in document store is a document. Document encoding schemes include XML, YAML, JSON, and BSON, as well as binary forms like PDF and Microsoft Office documents (MS Word, Excel, and so on). MongoDB documents are similar to JSON objects. Documents are composed of field-value pairs and have the following structure: The values of fields may include other documents, arrays, and arrays of documents. Data in MongoDB has a flexible schema in the same collection. All documents do not need to have the same set of fields or structure, and common fields in a collection's documents may hold different types of data. 1.1 General Questions 1. What are advantages of document stores over relational databases? 2. Can the data in document stores be normalized? 3. How does denormalization affect performance? Solution 1) Flexibility. Not every record needs to store the same properties. New properties can be added on the fly (Flexible schema). 2) Yes. References can be used for data normalization.
Transcript

BigDataforEngineers–Exercises

Spring2020–Week11–ETHZurich

MongoDB

IntroductionThis exercise will cover document stores. As a representative of document stores, MongoDB was chosen for the practicalexercises. Instructions are provided to install it on the Azure Portal.

1.DocumentstoresA record in document store is a document. Document encoding schemes include XML, YAML, JSON, and BSON, as well as binaryforms like PDF and Microsoft Office documents (MS Word, Excel, and so on). MongoDB documents are similar to JSON objects.Documents are composed of field-value pairs and have the following structure:

The values of fields may include other documents, arrays, and arrays of documents. Data in MongoDB has a flexible schema in thesame collection. All documents do not need to have the same set of fields or structure, and common fields in a collection'sdocuments may hold different types of data.

1.1GeneralQuestions1. What are advantages of document stores over relational databases?2. Can the data in document stores be normalized?3. How does denormalization affect performance?

Solution1) Flexibility. Not every record needs to store the same properties. New properties can be added on the fly (Flexible schema).

2) Yes. References can be used for data normalization.

3) All data for an object is stored in a single record. In general, it provides better performance for read operations (since expensivejoins can be omitted), as well as the ability to request and retrieve related data in a single database operation. In addition, embeddeddata models make it possible to update related data in a single atomic write operation.

1.2True/FalseQuestionsSay if the following statements are true or false.

1. Document stores expose only a key-value interface.2. Different relationships between data can be represented by references and embedded documents.3. MongoDB does not support schema validation.4. MongoDB encodes documents in the XML format.5. In document stores, you must determine and declare a table's schema before inserting data.6. MongoDB performance degrades when the number of documents increases.7. Document stores are column stores with flexible schema.8. There are no joins in MongoDB.

Solution1. (False) Document stores expose only a key-value interface.2. (True) Different relationships between data can be represented by references and embedded documents.3. (False) MongoDB does not support schema validation.4. (False) MongoDB encodes documents in the XML format.5. (False) In document stores, you must determine and declare a table's schema before inserting data.6. (True) MongoDB performance degrades when the number of documents increases.7. (False) Document stores are column stores with flexible schema.8. (True) There are no joins in MongoDB. Nonetheless,startinginversion3.2,MongoDBsupportsaggregationswith

"lookup"operator,whichcanperforma LEFT OUTER JOIN .

2.MongoDBIn this part of the exercise, you will setup a MongoDB image using AzureContainerInstances(ACI). By using ACI, apps can bedeployed without explicitly managing virtual machines. You can learn more about ACI here.

**Important: please delete your container after finishing the exercise.**

2.1InstallMongoDB1. Open the Azure portal and click "Createaresourece". After searching for container instances , click "Container

InstancesMicrosoft" and "Create".

2. In the "Basics" tab, select your subscription for this exercise, and create a new resource group.3. Fill in the container name and region. You can select any region you prefer.4. Select "DockerHuborotherregistry" for "Image source", and type in mongo in the "Image" field. By default, Azure will use

Docker Hub as the container registry. Leave other settings as default.

5. In the "Networking" tab, choose a DNS name for your container. Open port27017 which is the default port that MongoDBlistens to. Use TCP for the port.

6. Change nothing on the "Advanced" and "Tags" tabs.7. In the "Review" tab, review your resource settings and click "Create". The deployment should be finished in a couple of

minutes. In fact, fast startup time is one of the benefits of using ACI!

2.2SetupatestdatabaseAfter the container is deployed, we need to connect to the container to create a database user.

1. Select the recently created container resource from Azure portal, click "Settings-Containers", then choose the "Connect"tab. Use /bin/bash as start up command. Click "Connect".

2. Start MongoDB shell by mongo -shell .3. Select the admin database:

use admin

4. Then create a root user:

db.createUser( { user: "root", pwd: "root", roles:["root"] })

5. Log out from MongoDB shell:

exit

6. Now we are in the shell of the container. Download an example dataset:

apt update && apt install wget && wget https://raw.githubusercontent.com/mongodb/docs-assets/primer-dataset/primer-dataset.json

7. Import the dataset using mongoimport :

mongoimport --db test --collection restaurants --drop --file ./primer-dataset.json

If the dataset is successfully imported, you should see something similar to this:

2.3ConnecttotheMongoDBserverWe have finished setting up the database server. Next, we need to connect to the server using a pymongo client. First, installsome packages:

In [1]:

!pip install pymongo==3.10.1!pip install dnspython

Import some libraries:

In [2]:

from pymongo import MongoClient, errorsimport dnsfrom pprint import pprintimport urllibimport jsonimport dateutilfrom datetime import datetime, timezone, timedelta

In order to connect to MongoDB, we need to know the domain name of the host. In the resource console, click "Overview" to seethe basic information of the container. Copy the host URL from the "FQDN" field and paste it in the following cell. Execute it toconnect to the database.

In [3]:

# global variables for MongoDB host (default port is 27017)DOMAIN = 'mymongo.westeurope.azurecontainer.io' # Note: this should be replaced by the URL of yourown container!! PORT = 27017

# use a try-except indentation to catch MongoClient() errorstry: # try to instantiate a client instance client = MongoClient( host = [ str(DOMAIN) + ":" + str(PORT) ], serverSelectionTimeoutMS = 3000, # 3 second timeout username = "root", password = "root", )

db = client.test except errors.ServerSelectionTimeoutError as err: # set the client to 'None' if exception client = None

# catch pymongo.errors.ServerSelectionTimeoutError print ("pymongo ERROR:", err) db.restaurants

As a sanity check, we count the number of documents in the restaurants collection that we previously imported. It shouldmatch the number reported by mongoimport .

In [4]:

db.restaurants.count_documents({})

Requirement already satisfied: pymongo==3.10.1 in /home/nbuser/anaconda3_420/lib/python3.5/site-packages (3.10.1)WARNING: You are using pip version 19.3.1; however, version 20.1 is available.You should consider upgrading via the 'pip install --upgrade pip' command.Requirement already satisfied: dnspython in /home/nbuser/anaconda3_420/lib/python3.5/site-packages (1.16.0)WARNING: You are using pip version 19.3.1; however, version 20.1 is available.You should consider upgrading via the 'pip install --upgrade pip' command.

Out[3]:

Collection(Database(MongoClient(host=['mymongo.westeurope.azurecontainer.io:27017'], document_class=dict, tz_aware=False, connect=True, serverselectiontimeoutms=3000), 'test'), 'restaurants')

2.4MongoDBCRUDoperationsIn this section, we will go through some commonly used CRUD (Create, Read, Update, Delete) operations in MongoDB.

In [5]:

# Create a new collectionscientists = db['scientists']

In [6]:

# Insert some documents.# Note that documents can have nested structures, and the collection can be heterogeneous.scientists.insert_one({ "Name": { "First": "Albert", "Last": "Einstein" }, "Theory": "Particle Physics"})scientists.insert_one({ "Name": { "First": "Kurt", "Last": "Gödel" }, "Theory": "Incompleteness" })scientists.insert_one({ "Name": { "First": "Sheldon", "Last": "Cooper" }})

In [7]:

# Select all documents from the collectionscientists.find()

In [8]:

# As you can see, find() method returns a Cursor object. One must iterate the Cursor object to access individual documentsfor doc in scientists.find(): pprint(doc)

QueryDocumentsFor the db.collection.find() method, you can specify the following optional fields:

a queryfilter to specify which documents to return,

Out[4]:

25359

Out[6]:

<pymongo.results.InsertOneResult at 0x7fb8165fd048>

Out[7]:

<pymongo.cursor.Cursor at 0x7fb81609f278>

{'Name': {'First': 'Albert', 'Last': 'Einstein'}, 'Theory': 'Particle Physics', '_id': ObjectId('5eb590648e17271bc85a33cf')}{'Name': {'First': 'Kurt', 'Last': 'Gödel'}, 'Theory': 'Incompleteness', '_id': ObjectId('5eb590658e17271bc85a33d0')}{'Name': {'First': 'Sheldon', 'Last': 'Cooper'}, '_id': ObjectId('5eb590658e17271bc85a33d1')}

a queryfilter to specify which documents to return,a queryprojection to specify which fields from the matching documents to return (the projection limits the amount of data thatMongoDB returns to the client over the network),optionally, a cursormodifier to impose limits, skips, and sort orders.

In [9]:

# Using a query filterfor doc in db.scientists.find({"Theory": "Particle Physics"}): pprint(doc)

In [10]:

# Using a projectionfor doc in db.scientists.find({"Theory": "Particle Physics"}, {"Name.Last": 1}): pprint(doc)

In [11]:

# Using a projection, with "_id" output disabledfor doc in db.scientists.find({"Theory": "Particle Physics"}, {"_id": 0, "Name.Last": 1}): pprint(doc)

In [12]:

# Insert more documentsdoc_list = [ {"Name":"Einstein", "Profession":"Physicist"}, {"Name":"Gödel", "Profession":"Mathematician"}, {"Name":"Ramanujan", "Profession":"Mathematician"}, {"Name":"Pythagoras", "Profession":"Mathematician"}, {"Name":"Turing", "Profession":"Computer Scientist"}, {"Name":"Church", "Profession":"Computer Scientist"}, {"Name":"Nash", "Profession":"Economist"}, {"Name":"Euler", "Profession":"Mathematician"}, {"Name":"Bohm", "Profession":"Physicist"}, {"Name":"Galileo", "Profession":"Astrophysicist"}, {"Name":"Lagrange", "Profession":"Mathematician"}, {"Name":"Gauss", "Profession":"Mathematician"}, {"Name":"Thales", "Profession":"Mathematician"}]scientists.insert_many(doc_list)

In [13]:

# Using cursor modifiersprint("Using sort:")for doc in scientists.find({"Profession": "Mathematician"}, {"_id": 0, "Name": 1}).sort("Name", 1): pprint(doc) print("Using skip:")

{'Name': {'First': 'Albert', 'Last': 'Einstein'}, 'Theory': 'Particle Physics', '_id': ObjectId('5eb590648e17271bc85a33cf')}

{'Name': {'Last': 'Einstein'}, '_id': ObjectId('5eb590648e17271bc85a33cf')}

{'Name': {'Last': 'Einstein'}}

Out[12]:

<pymongo.results.InsertManyResult at 0x7fb8165e2d08>

for doc in scientists.find({"Profession": "Mathematician"}, {"_id": 0, "Name": 1}).sort("Name", 1).skip(1): pprint(doc) print("Using limit:")for doc in scientists.find({"Profession": "Mathematician"}, {"_id": 0, "Name": 1}).sort("Name", 1).skip(1).limit(3): pprint(doc)

In [14]:

# Updating documents

# Adding a new field:scientists.update_many({"Name": "Einstein"}, {"$set": {"Century" : "20"}})pprint(scientists.find_one({"Name": "Einstein"}))

# Changing the type of a field:scientists.update_many({"Name": "Nash"}, {"$set": {"Profession" : ["Mathematician", "Economist"]}})pprint(scientists.find_one({"Name": "Nash"}))

In [15]:

# Matching array elementsfor doc in scientists.find({"Profession": "Mathematician"}, {"_id": 0, "Name": 1, "Profession": 1}).sort("Name", 1): pprint(doc)

In [16]:

# Delete documentsscientists.delete_one({"Profession": "Astrophysicist"})scientists.count_documents({"Name": "Galileo"})

Using sort:{'Name': 'Euler'}{'Name': 'Gauss'}{'Name': 'Gödel'}{'Name': 'Lagrange'}{'Name': 'Pythagoras'}{'Name': 'Ramanujan'}{'Name': 'Thales'}Using skip:{'Name': 'Gauss'}{'Name': 'Gödel'}{'Name': 'Lagrange'}{'Name': 'Pythagoras'}{'Name': 'Ramanujan'}{'Name': 'Thales'}Using limit:{'Name': 'Gauss'}{'Name': 'Gödel'}{'Name': 'Lagrange'}

{'Century': '20', 'Name': 'Einstein', 'Profession': 'Physicist', '_id': ObjectId('5eb590678e17271bc85a33d2')}{'Name': 'Nash', 'Profession': ['Mathematician', 'Economist'], '_id': ObjectId('5eb590678e17271bc85a33d8')}

{'Name': 'Euler', 'Profession': 'Mathematician'}{'Name': 'Gauss', 'Profession': 'Mathematician'}{'Name': 'Gödel', 'Profession': 'Mathematician'}{'Name': 'Lagrange', 'Profession': 'Mathematician'}{'Name': 'Nash', 'Profession': ['Mathematician', 'Economist']}{'Name': 'Pythagoras', 'Profession': 'Mathematician'}{'Name': 'Ramanujan', 'Profession': 'Mathematician'}{'Name': 'Thales', 'Profession': 'Mathematician'}

Out[16]:

pymongo vsMongoDBshell

In the lecture, we learnt how to write queries in the syntax of the MongoDB shell. The syntax is a bit different from the syntax of pymongo . Here are a few examples:

MongoDBshell pymongo Note

Insert insert() insert_one() or insert_many() insert() is also valid for pymongo but deprecated.

Update update() update_one() or update_many() update() is also valid for pymongo but deprecated.

Delete delete() delete_one() or delete_many() delete() is also valid for pymongo but deprecated.

Sort criterion JSON document list of (key, direction) pairs

Namingconvention camelCase (e.g. createIndex ) snake_case (e.g. create_index )

Count db.collection.find(filter).count() db.collection.count_documents(filter) count() is also valid for pymongo but deprecated.

It is not necessary to remember these differences, but you should understand the semantics of a query written in either pymongoor MongoDB shell syntax.

2.5AlargerdatasetNow it's time to play with a dataset of more realistic size! Try to insert a document into the restaurants collection. In addition,you can see the structure of documents in the collection.

In [17]:

from dateutil.parser import isoparsedb.restaurants.insert_one( { "address" : { "street" : "2 Avenue", "zipcode" : "10075", "building" : "1480", "coord" : [ -73.9557413, 40.7720266 ] }, "borough" : "Manhattan", "cuisine" : "Italian", "grades" : [ { "date" : isoparse("2014-10-01T00:00:00Z"), "grade" : "A", "score" : 11 }, { "date" : isoparse("2014-01-16T00:00:00Z"), "grade" : "A", "score" : 17 } ], "name" : "Vella", "restaurant_id" : "41704620" })

In [18]:

# Query one document in a collection:pprint(db.restaurants.find_one())

Out[16]:

0

Out[17]:

<pymongo.results.InsertOneResult at 0x7fb81605d648>

{'_id': ObjectId('5eb5905562bbc67f6c4e9ea8'),

2.6QuestionsFor this part of the exercise, we will use the restaurants collection. Write queries in MongoDB that return the following:

1) All restaurants in borough (a town) "Brooklyn" and cuisine (a style of cooking) "Hamburgers".

In [19]:

# insert your query here:cursor = db.restaurants.find({"borough": "Brooklyn", "cuisine": "Hamburgers"})pprint(cursor[0])

2) The number of restaurants in the borough "Brooklyn" and cuisine "Hamburgers".

In [20]:

# insert your query here:db.restaurants.count_documents({"borough": "Brooklyn", "cuisine": "Hamburgers"})

3) All restaurants with zipcode 11225.

{'_id': ObjectId('5eb5905562bbc67f6c4e9ea8'), 'address': {'building': '1007', 'coord': [-73.856077, 40.848447], 'street': 'Morris Park Ave', 'zipcode': '10462'}, 'borough': 'Bronx', 'cuisine': 'Bakery', 'grades': [{'date': datetime.datetime(2014, 3, 3, 0, 0), 'grade': 'A', 'score': 2}, {'date': datetime.datetime(2013, 9, 11, 0, 0), 'grade': 'A', 'score': 6}, {'date': datetime.datetime(2013, 1, 24, 0, 0), 'grade': 'A', 'score': 10}, {'date': datetime.datetime(2011, 11, 23, 0, 0), 'grade': 'A', 'score': 9}, {'date': datetime.datetime(2011, 3, 10, 0, 0), 'grade': 'B', 'score': 14}], 'name': 'Morris Park Bake Shop', 'restaurant_id': '30075445'}

{'_id': ObjectId('5eb5905562bbc67f6c4e9ea9'), 'address': {'building': '469', 'coord': [-73.961704, 40.662942], 'street': 'Flatbush Avenue', 'zipcode': '11225'}, 'borough': 'Brooklyn', 'cuisine': 'Hamburgers', 'grades': [{'date': datetime.datetime(2014, 12, 30, 0, 0), 'grade': 'A', 'score': 8}, {'date': datetime.datetime(2014, 7, 1, 0, 0), 'grade': 'B', 'score': 23}, {'date': datetime.datetime(2013, 4, 30, 0, 0), 'grade': 'A', 'score': 12}, {'date': datetime.datetime(2012, 5, 8, 0, 0), 'grade': 'A', 'score': 12}], 'name': "Wendy'S", 'restaurant_id': '30112340'}

Out[20]:

102

In [21]:

# insert your query here:cursor = db.restaurants.find({"address.zipcode": "11225"})pprint(cursor[0])

4) Names of restaurants with zipcode 11225 that have at least one grade "C".

In [22]:

# insert your query here:cursor = db.restaurants.find({"address.zipcode": "11225", "grades.grade": "C"}, {"name": 1})pprint(cursor[0])

5) Names of restaurants with zipcode 11225 that have as first grade "C" and as second grade "A".

In [23]:

# insert your query here:cursor = db.restaurants.find({"address.zipcode": "11225", "grades.0.grade": "C", "grades.1.grade": "A"}, {"name": 1})pprint(cursor[0])

6) Names and streets of restaurants that don't have an "A" grade.

In [24]:

# insert your query here:cursor = db.restaurants.find({"grades.grade": {"$ne": "A"}}, {"name": 1, "address.street": 1})pprint(cursor[0])

7) All restaurants with a grade C and a score greater than 50 for that grade at the same time.

In [25]:

# insert your query here:cursor = db.restaurants.find({"grades": {"$elemMatch": {"grade": "C", "score": {"$gt": 50}}}})

{'_id': ObjectId('5eb5905562bbc67f6c4e9ea9'), 'address': {'building': '469', 'coord': [-73.961704, 40.662942], 'street': 'Flatbush Avenue', 'zipcode': '11225'}, 'borough': 'Brooklyn', 'cuisine': 'Hamburgers', 'grades': [{'date': datetime.datetime(2014, 12, 30, 0, 0), 'grade': 'A', 'score': 8}, {'date': datetime.datetime(2014, 7, 1, 0, 0), 'grade': 'B', 'score': 23}, {'date': datetime.datetime(2013, 4, 30, 0, 0), 'grade': 'A', 'score': 12}, {'date': datetime.datetime(2012, 5, 8, 0, 0), 'grade': 'A', 'score': 12}], 'name': "Wendy'S", 'restaurant_id': '30112340'}

{'_id': ObjectId('5eb5905562bbc67f6c4ea50c'), 'name': "Vee'S Restaurant"}

{'_id': ObjectId('5eb5905762bbc67f6c4ef036'), 'name': 'Careta Bar & Restaurant'}

{'_id': ObjectId('5eb5905562bbc67f6c4ea05d'), 'address': {'street': 'Thompson Street'}, 'name': 'Tomoe Sushi'}

cursor = db.restaurants.find({"grades": {"$elemMatch": {"grade": "C", "score": {"$gt": 50}}}})pprint(cursor[0])

8) All restaurants with a grade C or a score greater than 50.

In [26]:

# insert your query here:cursor = db.restaurants.find({"$or": [{"grades.score": {"$gt": 50}}, {"grades.grade": "C"}]})pprint(cursor[0])

9) All restaurants that have only A grades.

In [27]:

# insert your query here:cursor = db.restaurants.find({"grades": {"$not": {"$elemMatch": {"grade": {"$ne": "A"}}}}})pprint(cursor[0])

{'_id': ObjectId('5eb5905562bbc67f6c4e9eb4'), 'address': {'building': '1269', 'coord': [-73.871194, 40.6730975], 'street': 'Sutter Avenue', 'zipcode': '11208'}, 'borough': 'Brooklyn', 'cuisine': 'Chinese', 'grades': [{'date': datetime.datetime(2014, 9, 16, 0, 0), 'grade': 'B', 'score': 21}, {'date': datetime.datetime(2013, 8, 28, 0, 0), 'grade': 'A', 'score': 7}, {'date': datetime.datetime(2013, 4, 2, 0, 0), 'grade': 'C', 'score': 56}, {'date': datetime.datetime(2012, 8, 15, 0, 0), 'grade': 'B', 'score': 27}, {'date': datetime.datetime(2012, 3, 28, 0, 0), 'grade': 'B', 'score': 27}], 'name': 'May May Kitchen', 'restaurant_id': '40358429'}

{'_id': ObjectId('5eb5905562bbc67f6c4e9eb4'), 'address': {'building': '1269', 'coord': [-73.871194, 40.6730975], 'street': 'Sutter Avenue', 'zipcode': '11208'}, 'borough': 'Brooklyn', 'cuisine': 'Chinese', 'grades': [{'date': datetime.datetime(2014, 9, 16, 0, 0), 'grade': 'B', 'score': 21}, {'date': datetime.datetime(2013, 8, 28, 0, 0), 'grade': 'A', 'score': 7}, {'date': datetime.datetime(2013, 4, 2, 0, 0), 'grade': 'C', 'score': 56}, {'date': datetime.datetime(2012, 8, 15, 0, 0), 'grade': 'B', 'score': 27}, {'date': datetime.datetime(2012, 3, 28, 0, 0), 'grade': 'B', 'score': 27}], 'name': 'May May Kitchen', 'restaurant_id': '40358429'}

{'_id': ObjectId('5eb5905562bbc67f6c4e9eaa'), 'address': {'building': '351', 'coord': [-73.98513559999999, 40.7676919], 'street': 'West 57 Street',

3.IndexinginMongoDBIndexes support the efficient resolution of queries. Without indexes, MongoDB must scan every document of a collection to selectthose documents that match the query statement. Scan can be highly inefficient and require MongoDB to process a large volume ofdata.

Indexes are special data structures that store a small portion of the data set in an easy-to-traverse form. The index stores the valueof a specific field or set of fields, ordered by the value of the field as specified in the index.

MongoDB supports indexes that contain either a single field or multiple fields depending on the operations that this index typesupports.

By default, MongoDB creates the _id index, which is an ascending unique index on the _id field, for all collections when thecollection is created. You cannot remove the index on the _id field.

ManagingindexesinMongoDBAn explain() operator provides information on the query plan. It returns a document that describes the process and indexesused to return the query. This may provide useful insight when attempting to optimize a query. Example:

In [28]:

db.restaurants.find({"borough" : "Brooklyn"}).explain()

'street': 'West 57 Street', 'zipcode': '10019'}, 'borough': 'Manhattan', 'cuisine': 'Irish', 'grades': [{'date': datetime.datetime(2014, 9, 6, 0, 0), 'grade': 'A', 'score': 2}, {'date': datetime.datetime(2013, 7, 22, 0, 0), 'grade': 'A', 'score': 11}, {'date': datetime.datetime(2012, 7, 31, 0, 0), 'grade': 'A', 'score': 12}, {'date': datetime.datetime(2011, 12, 29, 0, 0), 'grade': 'A', 'score': 12}], 'name': 'Dj Reynolds Pub And Restaurant', 'restaurant_id': '30191841'}

Out[28]:

{'executionStats': {'allPlansExecution': [], 'executionStages': {'advanced': 6086, 'direction': 'forward', 'docsExamined': 25360, 'executionTimeMillisEstimate': 0, 'filter': {'borough': {'$eq': 'Brooklyn'}}, 'isEOF': 1, 'nReturned': 6086, 'needTime': 19275, 'needYield': 0, 'restoreState': 198, 'saveState': 198, 'stage': 'COLLSCAN', 'works': 25362}, 'executionSuccess': True, 'executionTimeMillis': 12, 'nReturned': 6086, 'totalDocsExamined': 25360, 'totalKeysExamined': 0}, 'ok': 1.0, 'queryPlanner': {'indexFilterSet': False, 'namespace': 'test.restaurants', 'parsedQuery': {'borough': {'$eq': 'Brooklyn'}}, 'plannerVersion': 1, 'rejectedPlans': [], 'winningPlan': {'direction': 'forward', 'filter': {'borough': {'$eq': 'Brooklyn'}}, 'stage': 'COLLSCAN'}},

In pymongo , you can create an index by calling the create_index() method. For example, we can create an index for the borough field:

In [29]:

db.restaurants.create_index("borough")

Now, let's see how the query plan changes to use the newly created index:

In [30]:

db.restaurants.find({"borough" : "Brooklyn"}).explain()

'stage': 'COLLSCAN'}}, 'serverInfo': {'gitVersion': '20364840b8f1af16917e4c23c1b5f5efd8b352f8', 'host': 'wk-caas-34de4dceae35483a9428b1463bda6c33-701ce6b11bdb2d440aceb0', 'port': 27017, 'version': '4.2.6'}}

Out[29]:

'borough_1'

Out[30]:

{'executionStats': {'allPlansExecution': [], 'executionStages': {'advanced': 6086, 'alreadyHasObj': 0, 'docsExamined': 6086, 'executionTimeMillisEstimate': 1, 'inputStage': {'advanced': 6086, 'direction': 'forward', 'dupsDropped': 0, 'dupsTested': 0, 'executionTimeMillisEstimate': 1, 'indexBounds': {'borough': ['["Brooklyn", "Brooklyn"]']}, 'indexName': 'borough_1', 'indexVersion': 2, 'isEOF': 1, 'isMultiKey': False, 'isPartial': False, 'isSparse': False, 'isUnique': False, 'keyPattern': {'borough': 1}, 'keysExamined': 6086, 'multiKeyPaths': {'borough': []}, 'nReturned': 6086, 'needTime': 0, 'needYield': 0, 'restoreState': 47, 'saveState': 47, 'seeks': 1, 'stage': 'IXSCAN', 'works': 6087}, 'isEOF': 1, 'nReturned': 6086, 'needTime': 0, 'needYield': 0, 'restoreState': 47, 'saveState': 47, 'stage': 'FETCH', 'works': 6087}, 'executionSuccess': True, 'executionTimeMillis': 10, 'nReturned': 6086, 'totalDocsExamined': 6086, 'totalKeysExamined': 6086}, 'ok': 1.0, 'queryPlanner': {'indexFilterSet': False, 'namespace': 'test.restaurants', 'parsedQuery': {'borough': {'$eq': 'Brooklyn'}}, 'plannerVersion': 1, 'rejectedPlans': [], 'winningPlan': {'inputStage': {'direction': 'forward',

The number of documents examined is indicated in the docsExamined field. The number drops significantly by using an index. Infact, in this example the number of documents examined is exactly the number of documents returned ( nReturned ).

The index specification describes the kind of index for that field. For example, a value of 1 specifies an index that orders items inascending order. A value of -1 specifies an index that orders items in descending order. Notethatindexdirectiononlymattersinacompoundindex.

To remove all indexes, you can use db.collection.drop_indexes() . Example:

In [31]:

print("Before drop_indexes():")for index in db.restaurants.list_indexes(): pprint(index)print("Now we drop all indexes...")db.restaurants.drop_indexes()print("After drop_indexes():")for index in db.restaurants.list_indexes(): pprint(index)

To remove a specific index you can use db.collection.drop_index(index_name) . Example:

In [32]:

print('Create some indexes first...')db.restaurants.create_index([('cuisine', -1), ('borough', 1)]) index_name = db.restaurants.create_index('address.building')print('\nNow we have these indexes:')for index in db.restaurants.list_indexes(): pprint(index) print('\nThen drop_index()...')db.restaurants.drop_index(index_name)print('\nThe remaining indexes are:')for index in db.restaurants.list_indexes(): pprint(index)

'winningPlan': {'inputStage': {'direction': 'forward', 'indexBounds': {'borough': ['["Brooklyn", "Brooklyn"]']}, 'indexName': 'borough_1', 'indexVersion': 2, 'isMultiKey': False, 'isPartial': False, 'isSparse': False, 'isUnique': False, 'keyPattern': {'borough': 1}, 'multiKeyPaths': {'borough': []}, 'stage': 'IXSCAN'}, 'stage': 'FETCH'}}, 'serverInfo': {'gitVersion': '20364840b8f1af16917e4c23c1b5f5efd8b352f8', 'host': 'wk-caas-34de4dceae35483a9428b1463bda6c33-701ce6b11bdb2d440aceb0', 'port': 27017, 'version': '4.2.6'}}

Before drop_indexes():{'key': SON([('_id', 1)]), 'name': '_id_', 'ns': 'test.restaurants', 'v': 2}{'key': SON([('borough', 1)]), 'name': 'borough_1', 'ns': 'test.restaurants', 'v': 2}Now we drop all indexes...After drop_indexes():{'key': SON([('_id', 1)]), 'name': '_id_', 'ns': 'test.restaurants', 'v': 2}

Create some indexes first...

3.1QuestionsPleaseanswerquestions1)and2)inMoodle.

1) Which queries will use the following index:

db.restaurants.create_index("borough")

A. db.restaurants.find({"address.city": "Boston"})B. db.restaurants.find({}, {"borough": 1})C. db.restaurants.find().sort([("borough", 1)])D. db.restaurants.find({"cuisine": "Italian"}, {"borough": 1})

Solution: Only query C would benefit from the index.

In [33]:

print('Creating index on "borough"...')db.restaurants.drop_indexes()db.restaurants.create_index("borough")

print('\nQuery A:')print(db.restaurants.find({"address.city": "Boston"}).explain()['executionStats']['executionStages'])

print('\nQuery B:')print(db.restaurants.find({}, {"borough": 1}).explain()['executionStats']['executionStages'])

print('\nQuery C:')print(db.restaurants.find().sort([("borough", 1)]).explain()['executionStats']['executionStages'])

print('\nQuery D:')print(db.restaurants.find({"cuisine": "Italian"}, {"borough": 1}).explain()['executionStats']['executionStages'])

Create some indexes first...

Now we have these indexes:{'key': SON([('_id', 1)]), 'name': '_id_', 'ns': 'test.restaurants', 'v': 2}{'key': SON([('cuisine', -1), ('borough', 1)]), 'name': 'cuisine_-1_borough_1', 'ns': 'test.restaurants', 'v': 2}{'key': SON([('address.building', 1)]), 'name': 'address.building_1', 'ns': 'test.restaurants', 'v': 2}

Then drop_index()...

The remaining indexes are:{'key': SON([('_id', 1)]), 'name': '_id_', 'ns': 'test.restaurants', 'v': 2}{'key': SON([('cuisine', -1), ('borough', 1)]), 'name': 'cuisine_-1_borough_1', 'ns': 'test.restaurants', 'v': 2}

Creating index on "borough"...

Query A:{'advanced': 0, 'stage': 'COLLSCAN', 'nReturned': 0, 'isEOF': 1, 'saveState': 198, 'direction': 'forward', 'needTime': 25361, 'works': 25362, 'filter': {'address.city': {'$eq': 'Boston'}}, 'needYield': 0, 'docsExamined': 25360, 'executionTimeMillisEstimate': 1, 'restoreState': 198}

Query B:{'advanced': 25360, 'transformBy': {'borough': 1}, 'stage': 'PROJECTION_SIMPLE', 'nReturned': 25360, 'isEOF': 1, 'saveState': 198, 'inputStage': {'advanced': 25360, 'stage': 'COLLSCAN',

2) Which queries will use the following index:

db.restaurants.create_index([("address", -1)])

A. db.restaurants.find({"address.zipcode": "11225"})B. db.restaurants.find({"address.city": "Boston"})C. db.restaurants.find({"address.city": "Boston"}, {"address": 1})D. db.restaurants.find({"address": 1})

Solution: Only query D would benefit from the index.

In [34]:

print('Creating index on "address"...')db.restaurants.drop_indexes()db.restaurants.create_index("address")

print('\nQuery A:')print(db.restaurants.find({"address.zipcode": "11225"}).explain()['executionStats']['executionStages'])

print('\nQuery B:')print(db.restaurants.find({"address.city": "Boston"}).explain()['executionStats']['executionStages'])

print('\nQuery C:')print(db.restaurants.find({"address.city": "Boston"}, {"address": 1}).explain()['executionStats']['executionStages'])

print('\nQuery D:')print(db.restaurants.find({"address": 1}).explain()['executionStats']['executionStages'])

0, 'isEOF': 1, 'saveState': 198, 'inputStage': {'advanced': 25360, 'stage': 'COLLSCAN', 'nReturned': 25360, 'isEOF': 1, 'saveState': 198, 'direction': 'forward', 'needTime': 1, 'works': 25362, 'needYield': 0, 'docsExamined': 25360, 'executionTimeMillisEstimate': 0, 'restoreState': 198}, 'needTime': 1, 'works': 25362, 'needYield': 0, 'executionTimeMillisEstimate': 0, 'restoreState': 198}

Query C:{'advanced': 25360, 'inputStage': {'dupsTested': 0, 'isPartial': False, 'dupsDropped': 0, 'stage': 'IXSCAN', 'nReturned': 25360, 'isEOF': 1, 'saveState': 198, 'direction': 'forward', 'needTime': 0,'isUnique': False, 'multiKeyPaths': {'borough': []}, 'keysExamined': 25360, 'indexBounds': {'borough': ['[MinKey, MaxKey]']}, 'indexName': 'borough_1', 'indexVersion': 2, 'isMultiKey': False, 'advanced': 25360, 'isSparse': False, 'works': 25361, 'needYield': 0, 'executionTimeMillisEstimate': 0, 'seeks': 1, 'restoreState': 198, 'keyPattern': {'borough': 1}}, 'stage': 'FETCH', 'nReturned': 25360, 'isEOF': 1, 'saveState': 198, 'needTime': 0, 'works': 25361,'needYield': 0, 'docsExamined': 25360, 'alreadyHasObj': 0, 'executionTimeMillisEstimate': 4, 'restoreState': 198}

Query D:{'advanced': 1070, 'transformBy': {'borough': 1}, 'stage': 'PROJECTION_SIMPLE', 'nReturned': 1070,'isEOF': 1, 'saveState': 198, 'inputStage': {'advanced': 1070, 'stage': 'COLLSCAN', 'nReturned': 1070, 'isEOF': 1, 'saveState': 198, 'direction': 'forward', 'needTime': 24291, 'works': 25362, 'filter': {'cuisine': {'$eq': 'Italian'}}, 'needYield': 0, 'docsExamined': 25360, 'executionTimeMillisEstimate': 1, 'restoreState': 198}, 'needTime': 24291, 'works': 25362, 'needYield': 0, 'executionTimeMillisEstimate': 1, 'restoreState': 198}

Creating index on "address"...

Query A:{'advanced': 112, 'stage': 'COLLSCAN', 'nReturned': 112, 'isEOF': 1, 'saveState': 198, 'direction': 'forward', 'needTime': 25249, 'works': 25362, 'filter': {'address.zipcode': {'$eq': '11225'}}, 'needYield': 0, 'docsExamined': 25360, 'executionTimeMillisEstimate': 1, 'restoreState': 198}

Query B:{'advanced': 0, 'stage': 'COLLSCAN', 'nReturned': 0, 'isEOF': 1, 'saveState': 198, 'direction': 'forward', 'needTime': 25361, 'works': 25362, 'filter': {'address.city': {'$eq': 'Boston'}}, 'needYield': 0, 'docsExamined': 25360, 'executionTimeMillisEstimate': 0, 'restoreState': 198}

Query C:{'advanced': 0, 'transformBy': {'address': 1}, 'stage': 'PROJECTION_SIMPLE', 'nReturned': 0, 'isEOF': 1, 'saveState': 198, 'inputStage': {'advanced': 0, 'stage': 'COLLSCAN', 'nReturned': 0, 'isEOF': 1, 'saveState': 198, 'direction': 'forward', 'needTime': 25361, 'works': 25362, 'filter':

3) Write a command for creating an index on the zipcode field.

Solution:

In [35]:

db.restaurants.drop_indexes()

# write your code here:db.restaurants.create_index([("address.zipcode", 1)])

# print all indexesfor index in db.restaurants.list_indexes(): pprint(index)

4) Write an index to speed up the following query:

db.restaurants.find({"grades.grade": {"$ne": "A"}}, {"name": 1 , "address.street": 1})

Solution:

In [36]:

db.restaurants.drop_indexes()

# write your code here:db.restaurants.create_index([("grades.grade", 1)])

# verify the query planprint(db.restaurants.find({"grades.grade": {"$ne": "A"}}, {"name": 1 , "address.street": 1}) .explain()['executionStats']['executionStages'])

5) Write an index to speed up the following query:

'isEOF': 1, 'saveState': 198, 'direction': 'forward', 'needTime': 25361, 'works': 25362, 'filter': {'address.city': {'$eq': 'Boston'}}, 'needYield': 0, 'docsExamined': 25360, 'executionTimeMillisEstimate': 0, 'restoreState': 198}, 'needTime': 25361, 'works': 25362, 'needYield': 0, 'executionTimeMillisEstimate': 0, 'restoreState': 198}

Query D:{'advanced': 0, 'inputStage': {'dupsTested': 0, 'isPartial': False, 'dupsDropped': 0, 'stage': 'IXSCAN', 'nReturned': 0, 'isEOF': 1, 'saveState': 0, 'direction': 'forward', 'needTime': 0, 'isUnique': False, 'multiKeyPaths': {'address': []}, 'keysExamined': 0, 'indexBounds': {'address': ['[1, 1]']}, 'indexName': 'address_1', 'indexVersion': 2, 'isMultiKey': False, 'advanced': 0, 'isSparse': False, 'works': 1, 'needYield': 0, 'executionTimeMillisEstimate': 0, 'seeks': 1, 'restoreState': 0, 'keyPattern': {'address': 1}}, 'stage': 'FETCH', 'nReturned': 0, 'isEOF': 1, 'saveState': 0, 'needTime': 0, 'works': 1, 'needYield': 0, 'docsExamined': 0, 'alreadyHasObj': 0, 'executionTimeMillisEstimate': 0, 'restoreState': 0}

{'key': SON([('_id', 1)]), 'name': '_id_', 'ns': 'test.restaurants', 'v': 2}{'key': SON([('address.zipcode', 1)]), 'name': 'address.zipcode_1', 'ns': 'test.restaurants', 'v': 2}

{'advanced': 1919, 'transformBy': {'name': 1, 'address.street': 1}, 'stage': 'PROJECTION_DEFAULT', 'nReturned': 1919, 'isEOF': 1, 'saveState': 115, 'inputStage': {'advanced': 1919, 'inputStage': {'dupsTested': 14742, 'isPartial': False, 'dupsDropped': 3126, 'stage': 'IXSCAN', 'nReturned': 11616, 'isEOF': 1, 'saveState': 115, 'direction': 'forward', 'needTime': 3127, 'isUnique': False, 'multiKeyPaths': {'grades.grade': ['grades']}, 'keysExamined': 14743, 'indexBounds': {'grades.grade': ['[MinKey, "A")', '("A", MaxKey]']}, 'indexName': 'grades.grade_1', 'indexVersion': 2, 'isMultiKey': True, 'advanced': 11616, 'isSparse': False, 'works': 14744, 'needYield': 0, 'executionTimeMillisEstimate': 0, 'seeks': 2, 'restoreState': 115, 'keyPattern': {'grades.grade': 1}}, 'stage': 'FETCH', 'nReturned': 1919, 'isEOF': 1, 'saveState': 115, 'needTime': 12824, 'works': 14744, 'filter': {'grades.grade': {'$not': {'$eq': 'A'}}}, 'needYield': 0, 'docsExamined': 11616, 'alreadyHasObj': 0, 'executionTimeMillisEstimate': 2, 'restoreState': 115}, 'needTime': 12824, 'works': 14744, 'needYield': 0, 'executionTimeMillisEstimate': 2, 'restoreState': 115}

5) Write an index to speed up the following query:

db.restaurants.find({"grades.score" : {"$gt" : 50}, "grades.grade" : "C"})

Solution:

In [37]:

db.restaurants.drop_indexes()

# write your code here:db.restaurants.create_index([("grades.score", 1), ("grades.grade", 1)])

# verify the query planprint(db.restaurants.find({"grades.score" : {"$gt" : 50}, "grades.grade" : "C"}) .explain()['executionStats']['executionStages'])

Comment: The index would not work for this query:

db.restaurants.find({"grades.grade" : "C"})

See the query plan below:

In [38]:

# verify the query planprint(db.restaurants.find({"grades.grade" : "C"}) .explain()['executionStats']['executionStages'])

**Important: please delete your container after finishing the exercise.**In [ ]:

{'advanced': 315, 'inputStage': {'dupsTested': 356, 'isPartial': False, 'dupsDropped': 7, 'stage': 'IXSCAN', 'nReturned': 349, 'isEOF': 1, 'saveState': 2, 'direction': 'forward', 'needTime': 7, 'isUnique': False, 'multiKeyPaths': {'grades.grade': ['grades'], 'grades.score': ['grades']}, 'keysExamined': 356, 'indexBounds': {'grades.grade': ['[MinKey, MaxKey]'], 'grades.score': ['(50, inf.0]']}, 'indexName': 'grades.score_1_grades.grade_1', 'indexVersion': 2, 'isMultiKey': True, 'advanced': 349, 'isSparse': False, 'works': 357, 'needYield': 0, 'executionTimeMillisEstimate': 0, 'seeks': 1, 'restoreState': 2, 'keyPattern': {'grades.grade': 1, 'grades.score': 1}}, 'stage': 'FETCH', 'nReturned': 315, 'isEOF': 1, 'saveState': 2, 'needTime': 41, 'works': 357, 'filter': {'grades.grade': {'$eq': 'C'}}, 'needYield': 0, 'docsExamined': 349, 'alreadyHasObj': 0, 'executionTimeMillisEstimate': 0, 'restoreState': 2}

{'advanced': 2708, 'stage': 'COLLSCAN', 'nReturned': 2708, 'isEOF': 1, 'saveState': 198, 'direction': 'forward', 'needTime': 22653, 'works': 25362, 'filter': {'grades.grade': {'$eq': 'C'}}, 'needYield': 0, 'docsExamined': 25360, 'executionTimeMillisEstimate': 2, 'restoreState': 198}


Recommended