Aarthi Raghvendra and Karen DavisElectrical Engineering and Computing Systems
University of Cincinnati
Cincinnati, Ohio, USA
Performance Evaluation of Analytical
Queries on a Stand-alone and Sharded
Document Store
Research Goals
• how do analytical queries perform in MongoDB?
• are there any benchmarks?
• what capabilities and features of MongoDB can
we utilize?
Overview
• MongoDB: data model
• previous performance studies
– identifying a benchmark
• experimental setup
– logical models
– query execution
• results and conclusions
• topics for future work
Why MongoDB?
• build applications faster
– flexible data model
– no need for object-relational mapping
• handle highly diverse data types
• manage applications more efficiently at scale
[https://www.mongodb.com/compare/mongodb-mysql]
MongoDB Data Model
• embedded document
(aka denormalized)
• referenced documents
• a((aka normalized)
[pingax.com]
Previous Performance Study:
SQL Server vs. MongoDB
• query execution time
comparison
• 3 tables
• at most 12,416 tuples
• MongoDB performed better
for most cases except an
aggregation query
• stand-alone DBMSs
[Parker, Poe, and Vrbsky, 2013]
Reasons for Big Data Benchmarks
• promote new technologies (algorithms, architectures,
processes)
• tune system configurations and plan improvements
[Han, Lu, and Xu, 2014]
[www.chamberofcommerce.com]
Big Data Benchmarks: Challenges
• key challenges and requirements for synthetic data
generation
– volume: meaningful with respect to type of workload
• scalable or partially-scalable
– velocity: rate of data generation and updates
• semi-controllable or uncontrollable
– variety: types of data
• structured, semi-structured, unstructured
– veracity: preservation of characteristics when scaling
• partially or un-considered
[Han, Lu, and Xu, 2014]
Big Data Benchmarks: Comparison
• criteria for our study:
– scalable volume
– semi-controllable velocity
– structured variety
– partially considered veracity
• candidate benchmarks:
– BigBench: [Ghazal et al., 2013]
• structured part: four tables adapted from TPC-DS
• 10 queries specified in English
– BigdataBench: two tables [Wang et al., 2014]
– TPC-DS: 24 tables and 100 queries [Poess, Nambiar, and Walrath, 2007]
[Han, Lu, and Xu, 2014]
TPC-DS Query Selection
• 100 queries; 23 in the data mining class
• criteria for selection (3 or more of):– join of 4 or more tables,
– aggregation functions such as sum() and avg(),
– group by and order by clause,
– conditional constructs such as case, and
– correlated subquery using from clause.
features/queries Query 7 Query 21 Query 46 Query 50
Number of tables 5 4 6 5
Number of aggregation functions 4 2 2 5
Number of group by/order by clauses 1 1 1 1
Number of conditional constructs 0 3 0 5
Number of correlated subquery(s) 0 0 1 0
TPC-DS: Schema
query coverage
• 3 (of 7) fact tables
• 9 (of 17) dimension tables
[www.tpc.org/tpcds/spec/tpcds_1.1.0.pdf]
Query 7select *
from(select w_warehouse_name
,i_item_id
,sum(case when (cast(d_date as date) < cast ('2002-05-29' as date))
then inv_quantity_on_hand
else 0 end) as inv_before
,sum(case when (cast(d_date as date) >= cast ('2002-05-29' as date))
then inv_quantity_on_hand
else 0 end) as inv_after
from inventory
,warehouse
,item
,date_dim
where i_current_price between 0.99 and 1.49
and i_item_sk = inv_item_sk
and inv_warehouse_sk = w_warehouse_sk
and inv_date_sk = d_date_sk
and d_date between (cast ('2002-05-29' as date) - 30 days)
and (cast ('2002-05-29' as date) + 30 days)
group by w_warehouse_name, i_item_id) x
where (case when inv_before > 0
then inv_after / inv_before
else null
end) between 2.0/3.0 and 3.0/2.0
order by w_warehouse_name
,i_item_id
select i_item_id,
avg(ss_quantity) agg1,
avg(ss_list_price) agg2,
avg(ss_coupon_amt) agg3,
avg(ss_sales_price) agg4
from store_sales, customer_demographics, date_dim
,item, promotion
where ss_sold_date_sk = d_date_sk and
ss_item_sk = i_item_sk and
ss_cdemo_sk = cd_demo_sk and
ss_promo_sk = p_promo_sk and
cd_gender = 'M' and
cd_marital_status = 'M' and
cd_education_status = '4 yr Degree' and
(p_channel_email = 'N' or p_channel_event = 'N')
and d_year = 2001
group by i_item_id
order by i_item_id
Query 21
Data Set Sizes TableNumber of Records
1GB 5GB
Call_center 6 14
Catalog_page 11,718 11,718
Catalog_returns 144,067 720,174
Catalog_sales 1,441,548 7,199,490
Customer 100,000 27,7000
Customer_address 50,000 138,000
Customer_demographics 1,920,800 1,920,800
Date_dim 73,049 73,049
Household_demographics 7,200 7,200
Income_band 20 20
Inventory 11,745,000 49,329,000
Item 18,000 54,000
Promotion 300 388
Reason 35 39
Ship_mode 20 20
Store 12 52
Store_returns 287,514 1,437,911
Store_sales 2,880,404 14,400,052
Time_dim 86,400 86,400
Warehouse 5 7
Web_page 60 122
Web_returns 71,763 359,991
Web_sales 719,384 3,599,503
Web_site 30 34
In MongoDB format:
• 1GB is 9.94GB
• 5GB is 41.93GB
Experimental Platform
• Amazon Web Services Elastic Compute Cloud (EC2)– 2.4 GHz Intel Xeon E5-2676 v3 (Haswell) processors
– RAM/number of shards chosen accommodates the working set
– Red Hat Enterprise Linux 7.1
• MongoDB 3.0.2
Stand-alone
Environment
9.94GB sharded
environment
41.93GB sharded
environment
Number of
machines1 5 5
Instance Type m4.4xlarge t2.large m4.xlarge
vCPUs 16 2 4
RAM 64GB 8GB 16GB
Migrating Relational Data into MongoDB
• normalized: each table becomes a collection
• denormalized: embeds dimension collections in
fact collection
Translating Queries for the
Normalized Approach
• performs semi-join with fact and dimension tables
• embeds referenced dimensions, leveraging projections
for needed fields
• performs aggregation
• the result is an embedded collection
MongoDB Operators SQL Operators
$project select
$match where or having
$limit limit
$group group by
$sort order by
$sum sum or count
Performance Study: Experiments
• data load times
– 47 minutes and 3.5 hours, respectively, for 9.9GB and 41.9GB
• query execution times
– both data sets, standalone and sharded for normalized data
– both data sets, standalone for denormalized data
– 4 queries for both
• best case:
‒ denormalized, standalone
• Query 50:
‒ includes a specific shard key
‒ does not broadcast to all shards
Conclusions
• We develop a data loading algorithm to migrate data
generated by the TPC-DS benchmark into MongoDB for
performing query analysis.
• Since MongoDB did not support joins, we developed an
algorithm to simulate join operations and perform
aggregation operations.
• We investigate two sizes of data sets, two data
organization schemes, data distribution, and analytical
queries.
• The trend in execution times remains the same with the
increase in scalability we investigated.
Future Work
• consider other ways to denormalize and construct
workloads [Herrero, Abello, and Romero, 2016]
• use BigBench for unstructured and semi-structured data
• collect other metrics such as data read/transmitted
• investigate MongoDB 3.4
– $lookup added in 3.2
– real-time analytics
– graph computations
• compare to Teradata Aster SQL-MR
• compare to other NoSQL models [www.business2community.com]
[Truică, Bucur and Boicea, 2015]
Query 46select c_last_name
,c_first_name
,ca_city
,bought_city
,ss_ticket_number
,amt,profit
from
(select ss_ticket_number
,ss_customer_sk
,ca_city bought_city
,sum(ss_coupon_amt) amt
,sum(ss_net_profit) profit
from store_sales,date_dim,store,household_demographics,customer_address
where store_sales.ss_sold_date_sk = date_dim.d_date_sk
and store_sales.ss_store_sk = store.s_store_sk
and store_sales.ss_hdemo_sk = household_demographics.hd_demo_sk
and store_sales.ss_addr_sk = customer_address.ca_address_sk
and (household_demographics.hd_dep_count = 2 or
household_demographics.hd_vehicle_count= 3)
and date_dim.d_dow in (6,0)
and date_dim.d_year in (1998,1998+1,1998+2)
and store.s_city in ('Midway','Fairview','Fairview','Fairview','Fairview')
group by ss_ticket_number,ss_customer_sk,ss_addr_sk,ca_city)
dn,customer,customer_address current_addr
where ss_customer_sk = c_customer_sk
and customer.c_current_addr_sk = current_addr.ca_address_sk
and current_addr.ca_city <> bought_city
order by c_last_name
,c_first_name
,ca_city
,bought_city
,ss_ticket_number
select
s_store_name
,s_company_id
,s_street_number
,s_street_name
,s_street_type
,s_suite_number
,s_city
,s_county
,s_state
,s_zip
,sum(case when (sr_returned_date_sk - ss_sold_date_sk <= 30 ) then 1 else 0 end) as "30 days"
,sum(case when (sr_returned_date_sk - ss_sold_date_sk > 30) and
(sr_returned_date_sk - ss_sold_date_sk <= 60) then 1 else 0 end ) as "31-60 days"
,sum(case when (sr_returned_date_sk - ss_sold_date_sk > 60) and
(sr_returned_date_sk - ss_sold_date_sk <= 90) then 1 else 0 end) as "61-90 days"
,sum(case when (sr_returned_date_sk - ss_sold_date_sk > 90) and
(sr_returned_date_sk - ss_sold_date_sk <= 120) then 1 else 0 end) as "91-120 days"
,sum(case when (sr_returned_date_sk - ss_sold_date_sk > 120) then 1 else 0 end) as ">120
days"
from
store_sales
,store_returns
,store
,date_dim d1
,date_dim d2
Query 50
where
d2.d_year = 1998
and d2.d_moy = 10
and ss_ticket_number = sr_ticket_number
and ss_item_sk = sr_item_sk
and ss_sold_date_sk = d1.d_date_sk
and sr_returned_date_sk = d2.d_date_sk
and ss_customer_sk = sr_customer_sk
and ss_store_sk = s_store_sk
group by
s_store_name
,s_company_id
,s_street_number
,s_street_name
,s_street_type
,s_suite_number
,s_city
,s_county
,s_state
,s_zip
order by s_store_name
,s_company_id
,s_street_number
,s_street_name
,s_street_type
,s_suite_number
,s_city