Date post: | 20-Jun-2015 |
Category: |
Technology |
Upload: | mongodb |
View: | 1,318 times |
Download: | 2 times |
Appboy Analytics Jon Hyman NY MongoDB User Group, November 19, 2013 eBay NYC
@appboy @jon_hyman
A LITTLE BIT ABOUT US & APPBOY
Jon Hyman CIO :: @jon_hyman !
Appboy is a mobile relationship management platform for apps
(who we are and what we do)
Harvard Bridgewater
Appboy improves engagement by helping you understand your app users• IDENTIFY - Understand demographics,
social and behavioral data
• SEGMENT - Organize customers into
groups based on behaviors, events, user
attributes, and location
• ENGAGE - Message users through
push notifications, emails, and multiple
forms of in-app messages
Use Case: Customer engagement begins with onboarding
Urban Outfitters textPlus Shape Magazine
Agenda
• How to quickly store time series data in MongoDB using flexible schemas
• Learn how flexible schemas can easily provide breakdowns across dimensions
• Counting quickly: statistical analysis on top of MongoDB queries
What kinds of analytics does Appboy track?• Lots of time series data
• App opens over time
• Events over time
• Revenue over time
• Marketing campaign stats and efficacy over time
What kinds of analytics does Appboy track?
• Breakdowns* • Device types
• Device OS versions
• Screen resolutions
• Revenue by product
* We also care about this over time!
What kinds of analytics does Appboy track?
• User segment membership • How many users are in each segment?
• How many can be emailed or reached via push notifications?
• What is the average revenue per user in the segment?
• Per paying user?
Pre-aggregated Analytics:
APP OPENS OVER TIME
Typical time series collection
Log a new row for each open received !{! timestamp: 2013-11-14 00:00:00 UTC,! app_id: App identifier!}!!db.app_opens.find({app_id: A, timestamp: {$gte: date}})!
Con: You need to aggregate the data before drawing the chart; lots of documents read into memory, lots of dirty pages
Pro: Really, really simple. Easy to add attribution to users.
Fewer documents with pre-aggregation iteration 1
Create a document that groups by the time period ! {! app_id: App identifier,! date: Date of the document,! hour: 0-23 based hour this document represents,! opens: Number of opens this hour! }!!db.app_opens.update({date: D, app_id: A, hour: 0}, {$inc: {opens:1}})
Con: We never care about an hour by itself. We lose attribution.
Pro: Really easy to draw histograms
Fewer documents with pre-aggregation iteration 2Create a document by day and have each hour be a field ! {! app_id: App identifier,! date: Date of the document,! total_opens: Total number of opens this day,! 0: Number of opens at midnight,! 1: Number of opens at 1am,! ...! 23: Number of opens at 11pm! }!! db.app_opens.update(! {date: D, app_id: A}, ! {$inc: {“0”:1, total:1}}! )
Pro: Document count is low, easy to use aggregation framework for longer spans, fast: document should be in working set
Fewer documents with pre-aggregation iteration 2
• What about looking at different dimensions?
• App opens by device type (e.g., how do iPads
compare to iPhones?)
• Demographics (gender, age group)
Solution!
FLEXIBLE SCHEMAS!
Fewer documents with pre-aggregation iteration 3
!{! app_id; App identifier,! date: Date of the document,! totals: {! app_opens: Total number of opens this day,! devices: {! "iPad Air": Total number of opens on the iPad Air,! "iPhone 4": Total number of opens on the iPhone 4,! },! genders: {! male: Total number of opens from male users,! female: Total number of opens from female users! },! ...! },! 0: {! app_opens: Number of opens at midnight,! devices: {! "iPad Air": Number of opens on the iPad Air at midnight,! "iPhone 4": Number of opens on the iPhone 4 at midnight,! },! ...! },! ...!}!!db.app_opens.update({date: D, app_id: A}, {$inc: {“0”:1, total:1}})
Dynamically add dimensions in the document
Pre-aggregated analytics
• Pros • Easily extensible to add other dimensions
• Still only using one document, therefore you can create
charts very quickly
• You get breakdowns over a time period for free
!
• Cons • Pre-aggregated data has no attribution
• Have to know questions ahead of time
Follow up: What if we wanted to look at a graph by age group?
Pre-aggregated analytics summary
• Get started tracking time series data quickly
• You get breakdowns for free
• Adding dimensions is super simple
• No attribution, need to know questions ahead of time
• Don’t just rely on pre-aggregated analytics
Counting quickly:
USER SEGMENTATION & STATISTICAL ANALYSIS
User Segmentation
• A group of users who match some set of filters
Counting quickly
Appboy shows you segment membership in real-time as you add/edit/remove filters. !
How do we do it quickly? !
We estimate the population sizes of segments when using our web UI.
Counting quickly
Goal: Quickly get the count() of an arbitrary query !
Problem: MongoDB counts are slow, especially unindexed ones
Counting quickly
{! favorite_color: “blue”,! age: 27,! gender: “M”,! favorite_food: “pizza”,! city: “NYC”,! shoe_size: 11,! attractiveness: 10,! ...! } !
10 million documents that represent people:
Counting quickly
{! favorite_color: “blue”,! age: 27,! gender: “M”,! favorite_food: “pizza”,! city: “NYC”,! shoe_size: 11,! attractiveness: 10,! ...! } !
10 million documents that represent people:
• How many people like blue? • How many live in NYC and love pizza? • How many men have a shoe size less than 10?
Big Question: How do you estimate counts?
Answer: The same way news
networks do it.
!
With confidence.
Counting quickly
{! random: 4583,! favorite_color: “blue”,! age: 27,! gender: “M”,! favorite_food: “pizza”,! city: “NYC”,! shoe_size: 11,! attractiveness: 10,! ...! } !
Add a random number in a known range to each document. Say, between 0 and 9999.
Add an index on the random number: !db.users.ensureIndex({random:1})
Counting quickly
Step 1: Get a random sample !I have 10 million documents. Of my 10,000 random “buckets”, I should expect each “bucket” to hold about 1,000 users. !E.g., !db.users.find({random: 123}).count() == ~1000!db.users.find({random: 9043}).count() == ~1000!db.users.find({random: 4982}).count() == ~1000
Counting quickly
Step 1: Get a random sample !Let’s take a random 100,000 users. Grab a random range that “holds” those users. These all work: !db.users.find({random: {$gt: 0, $lt: 101})!db.users.find({random: {$gt: 503, $lt: 604})!db.users.find({random: {$gt: 8938, $lt: 9039})!db.users.find({$or: [! {random: {$gt: 9955}}, ! {random: {$lt: 56}}!])
Tip: Limit $maxScan to 100,000 just to be safe
Counting quicklyStep 2: Learn about that random sample !db.users.find(! {! random: {$gt: 0, $lt: 101},! gender: “M”,! favorite_color: “blue”,! size_size: {$gt: 10}! }, !)!._addSpecial(“$maxScan”, 100000)!.explain()
Explain Result: !{! nscannedObjects: 100000,! n: 11302,! ...!} !
Counting quickly
Step 3: Do the math !Population: 10,000,000 !Sample size: 100,000 !Num matches: 11,302 !Percentage of users who matched: 11.3% !Estimated total count: 1,130,000 +/- 0.2% with 95% confidence
Counting quickly
Step 4: Optimize !• Limit $maxScan to (100,000/numShards) to be even faster !
• Cache the random range for a few hours !
• Add more RAM (or shards) !
• Cache results to not hit the database for the same query
Counting quickly
Step 5: Improve !• Get more than one count: use the aggregation framework on top of the population’s sample size
• Work around all sorts of Mongo bugs :-(
Summarize
• Pre-aggregated analytics
• Create a document that represents event occurrences
in some time period
• Takes full advantage of MongoDB’s flexible schemas
• Not a catch-all for analytics, you should still store event
data
Summarize
• Counting quickly
• Estimate results of arbitrary queries using population
sample sizes
• Depending on your app, this could be a great way to
keep response time predictable as you scale