Fast Queries on Large Datasets Using MongoDB and Summary Documents

The past few months we at Signal have been spending a considerable amount of time and effort enhancing the performance of our application for some of our larger customers. It wasn’t that long ago that our largest subscription list was only 80,000 subscribers. We now have many customers with lists topping a million subscribers, with our largest currently sitting at 8.5 million. That’s quite a jump in size, and not one that can generally be made without making a few tweaks to the application. With this jump, certain areas of our application began to slow down considerably. One such area was subscription list reporting. Many of the reports were backed by SQL queries that were becoming increasingly expensive to run against the ever growing tables.

To address this issue, we decided to create daily summaries of the subscription list data, and report off of the summary data instead of the raw data. The vast majority of our list reporting is already broken out by day, so this seemed logical. And, we decided to use MongoDB to store the summary data.

Why MongoDB?

We are already running MySQL and CouchDB in production. Why not use one of those? Why introduce a 3rd database product to the architecture?

The summary data is live data, not archived data (like the data we are storing in CouchDB). As subscriptions are created/destroyed for the current day, we need to increment/decrement the appropriate metrics. There are also some use cases where we need to alter summary documents for days in the past. This means we need support for atomic operations in the underlying database, to prevent race conditions from skewing the stats. CouchDB does not have the ability to atomically update a value in a document with a single call, so that took it out of the running. Some of our more active lists see several new subscriptions a second. Dealing with document update conflicts would have been a nightmare.

We decided to go with MongoDB over MySQL because of the structure of the summary data. We had several metrics that we wanted to keep tabs on (opt ins per day, opt outs per day, etc). Some of these metrics had a nested nature to them. For example, in addition to keeping track of the number of opt-outs per day, we also want to keep track of the reasons those subscriptions were canceled. Also, there are several ways that a user can opt-in to a subscription list. For example, each list can have several different keywords for entry via SMS. We needed a way of breaking all of these metrics out by method of entry.

Also, the main data structure containing the stats is repeated twice within the document. Once for the current day’s stats, and once for the totals for that subscription list up until the date in the document. This “to_date” data structure keeps us from having to evaluate ALL of a list’s documents in order to determine how many subscribers there are on a given date. With the “to_date” data structure, the specific day’s document is all we need.

We decided that this data structure was better represented as a single JSON document instead of a series of tables in MySQL. It seemed much cleaner to have a single document per list, per day that contained all of the information for that list’s daily activity than to have it scattered about in a series of relational tables. Choosing MongoDB meant giving up on SQL’s aggregate functions (AVG(), MAX(), MIN(), SUM(), etc), but the benefits provided by the simplicity of the data structure seemed to make up for the loss of these functions.

An example summary document

{
  "campaign_id": 1,
  "subscription_campaign_id": 2,
  "account_id": 1,
  "date": ISODate("2010-10-13T00:00:00Z"),
  "stats": {
    "sms_101": {
      "confirmed_opt_ins": 100,
      "unconfirmed_opt_ins": 15,
      "unconfirmed_opt_outs": 10,
      "unconfirmed_opt_out_reasons": { "UI": 5, "CNF": 5 },
      "confirmed_opt_outs": 20,
      "confirmed_opt_out_reasons": { "UI": 10, "CNF": 10}
    }
    "sms_102": {
      "confirmed_opt_ins": 200,
      "unconfirmed_opt_ins": 30,
      "unconfirmed_opt_outs": 20,
      "unconfirmed_opt_out_reasons": { "UI": 5, "CNF": 15 },
      "confirmed_opt_outs": 35,
      "confirmed_opt_out_reasons": { "UI": 15, "CNF": 20 },
    }
    "email": {
      "confirmed_opt_ins": 300,
      "unconfirmed_opt_ins": 35,
      "unconfirmed_opt_outs": 25,
      "unconfirmed_opt_out_reasons": { "UI": 20, "BULK": 5 },
      "confirmed_opt_outs": 30,
      "confirmed_opt_out_reasons": { "UI": 20, "BULK": 10 }
    }
  }
  "to_date": {
    "sms_101": {
      "confirmed_opt_ins": 1100,
      "unconfirmed_opt_ins": 1000,
      "unconfirmed_opt_outs": 200,
      "unconfirmed_opt_out_reasons": { "UI": 100, "CNF": 100 },
      "confirmed_opt_outs": 400,
      "confirmed_opt_out_reasons": { "UI": 300, "CNF": 100 }
    }
    "sms_102": {
      "confirmed_opt_ins": 2200,
      "unconfirmed_opt_ins": 200,
      "unconfirmed_opt_outs": 150,
      "unconfirmed_opt_out_reasons": { "UI": 100, "CNF": 50 },
      "confirmed_opt_outs": 400,
      "confirmed_opt_out_reasons": { "UI": 250, "CNF": 150 }
    }
    "email": {
      "confirmed_opt_ins": 2050,
      "unconfirmed_opt_ins": 125,
      "unconfirmed_opt_outs": 75,
      "unconfirmed_opt_out_reasons": { "UI": 75 },
      "confirmed_opt_outs": 750,
      "confirmed_opt_out_reasons": { "UI": 600, "BULK": 150 }
    }
  }
}

A possible relational database schema

MongoDB’s Atomic Operations

As I mentioned above, MongoDB’s atomic operations were key to us choosing MongoDB for this task.

MongoDB does not support locking or transactions like you would find in a traditional relational database. However, it does support a number of operations that are guaranteed to be atomic on a single document. If your data is designed so that all related data (or at least data that would need to be updated at the same time) is contained within a single document, then the atomic operations supported by MongoDB should be more than enough to handle the majority of use cases.

For this project, we made heavy use of the $inc operation. $inc will atomically increment or decrement a value anywhere in the document by the specified value. If no property exists in the document with that name, $inc will create it, and set its initial value to the value you wanted to increment it by. For this project, we simply initialized the data structure holding the metrics to an empty JSON hash when the summary document is created. The first time $inc is used to increment or decrement some metric, it will insert the metric into the hash, along with the proper initial value. Subsequent calls to update the document using the $inc operator will then update that value accordingly.

Using the $inc operation also meant we didn’t have to read the document to get the current value of the field in order to increment/decrement its value. We were simply able to increment/decrement the value by making one call to the database, keeping things nice and simple.

Atomically incrementing a document’s confirmed subscriptions count

collection = db.collection('subscription_statistics')
today = Date.today.to_time
collection.update({'campaign_id' => 1, 'date' => today},
  {"$inc" => {"stats.sms_101.confirmed_opt_ins" => 1}},
  {:safe => true})

MapReduce

MongoDB also supports MapReduce, providing you the ability to evaluate your data in ways that simply can’t be done using their standard queries. For this project, we needed to support summing the values of specific keys across several documents (to calculate the total opt-ins over a date range, for example). Initially, this sounded like a good fit for MapReduce. However, MapReduce will run the specified map function against each document in the database. The more documents your database has, the longer it will take MapReduce to run. I hoped that since we had built an index for the fields that MapReduce was using to determine if a document should be selected, that MongoDB would utilize that index to help find the eligible documents. Unfortunately that was not the case.

In our case, since we are only dealing with 365 documents for a year’s worth of statistics, it was considerably faster for us to find the documents using MongoDB’s standard queries and sum the data in ruby code, than to use MapReduce to do the same. If we were evaluating ALL of the documents in the database, then MapReduce would have been a much better option. I understand that 10gen is hard at work on making MapReduce faster for MongoDB 2.0. Having a strong MapReduce framework can be a powerful tool for a statistics database. Hopefully we’ll be able to utilize it in the future.

The Results

The results were staggering. On our largest list, the overview chart we display showing the current number of subscriptions per day over the last 30 days went from taking 37726ms to load to just 502ms. And the summary report for the list, which contains a wealth of statistics for the list including opt-ins / opt-outs per day, subscriber count by opt in keyword, and a series of summary statistics, went from taking 64836ms to load to just 322ms.

Dramatically reducing the size of the data being evaluated had an equally dramatic effect on the amount of time it took to evaluate that data. And, MongoDBs atomic operations and dynamic queries made this project a blast to work on.

Be Sociable, Share!

    10 thoughts on “Fast Queries on Large Datasets Using MongoDB and Summary Documents

    1. “I hoped that since we had built an index for the fields that MapReduce was using to determine if a document should be selected, that MongoDB would utilize that index to help find the eligible documents. Unfortunately that was not the case.”

      If you use the query option to the mapreduce command it will be able to use the index to filter the objects to map over. However if you do the selection in your map function with an if statement then that can’t use an index. Either way for 365 smallish docs, it will still probably be faster to process client-side anyway, especially if you make good use of the second argument to find.

    2. Thanks for the article, which I came across in searching for usages of summary tables in mongo. A few questions:

      1) Since you’re summarizing daily, does that mean you just pick a timezone and stick with it forever? In the code sample it looks like you’re using UTC, but you’re based in the US. Is it an issue to people looking at the reports that the “days” are 6-8 hours off of what they might be expecting? (This is something I’m trying to plan around).

      2) Maybe I’ve misread the conclusion, but if it takes on the order of a half a second to run a report that’s querying 30 rows, that still seems about an order of magnitude too slow, no? At least, coming from a relational background, I would expect this to be a tens-of and not hundreds-of millisecond operation. Do you think this this because you wound up having to do the summations post-facto, Ruby?

      Thanks again for sharing the article.

    3. Hi Tom,

      Thanks for the comments.

      1) The document represents a day in the user’s time zone. The time zone in the “date” field is not used. So, the user is stuck with this timezone, but not forever. The summary documents summarize data that is stored in our MySQL database. If necessary, we can always rebuild them for a user based on a new time zone.

      2) The 502ms is for the full HTTP request, and not just the database query. I don’t have the breakdown, but I don’t think MongoDB was a big part of that time. I think much of it was spent in Ruby/Rails code (we fetch the documents from MongoDB, and then sum the data using Ruby code), and the HTTP request in general. If I’m able to track down the breakdown of that request, I’ll post it here.

    4. Hi John.

      Interesting article!
      As i see you have an extended knowledge about using mongodb :), currently i’m working
      on deploying mongodb for a system, but i have more than 5 millions documents stored
      in a collection, how may i deal with reporting, statistics stuffs?

      Do you think it’s good idea to adapt MapReduce to accomplish it?
      I’m afraid it takes long to load because it’s an “almost” real time and report system.

      Thank you again John! :)

    5. Hi Ezequiel,

      That’s a lot of documents :) For the project I described in this blog post, we used daily summary documents (taking advantage of the document’s schema-less nature) to avoid having to deal with millions of individual, detailed docs. So, we’re dealing with hundreds of documents at a time, not millions.

      MongoDB 2.2, which was just released, sports a new aggregation framework (http://docs.mongodb.org/manual/aggregation/). You may want to take a look at this to see if it supports what you will want to do, reporting wise. It seems much more powerful than the map/reduce in earlier versions of MongoDB, and includes a few features aimed at dealing with large data sets (early filtering, sharded operation, etc). However, this framework does have limitations that are clearly documented. So, you’ll certainly want to create a prototype with a large data set to make sure it can do what you want it to do.

      In my opinion, you can’t beat a relational database on a big server with a lot of RAM for reporting.

    6. By the time I left Signal, MongoDB’s aggregation framework had not yet been released. I’m not sure if it was ever ported.

    Leave a Reply

    Your email address will not be published. Required fields are marked *

    You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>