Introducing Proby – Task monitoring made simple!
Note: This entry has been cross-posted from the Signal company blog.
One Monday morning about a month ago, I was browsing through open issues in our bug tracker looking for something to work on. It was my week on “technical support”. At Signal, the engineer on technical support spends a week working on non customer facing issues with our infrastructure, process, etc, that are too large to simply knock out as you encounter them.
An issue entitled “Add monitoring to detect jobs that do not start” caught my attention. We already had monitoring in place to alert us when a job fails. However, we had nothing to alert us if that job never starts in the first place. We have many scheduled jobs that run in the background, some pretty important, and have been bitten on more than one occasion by reoccurring jobs that had not run for quite some time.
I bounced some ideas off the team, and we all agreed that that it made sense to create a new application to monitor our scheduled tasks. The app would know what is supposed to run, and when. We would change all of our jobs to “ping” the app when they start and finish. If the app detects that something did not run when it was supposed to, or did not finish when it was supposed to, it could alert the team.
And just like that, Proby was born.
“What’s with the name?”, you ask? I likened the application to a probation officer, keeping an eye on wayward tasks.
Since going live, Proby has caught several issues with our scheduled jobs that would have gone unnoticed…many more than I would have anticipated. It has also evolved into a fairly complete application. It is easy to setup new tasks. It holds onto the execution history of each task, allowing you to see a trend of the run time for each task, and which runs resulted in an alarm. It supports alarms via email or SMS, using Signal’s SMS messaging API. It has several settings that allow you to tweak exactly when an alarm is sent, cutting down on the number of false alarms. And, thanks to designer extraordinaire Drew Myler, it looks amazing!
Shortly after creating Proby, we realized that there is no way we are the only ones who have had scheduled tasks go days without running, unnoticed. So today, we are making Proby available to the general public. It is currently in closed beta, and we are letting people in slowly to flush out any bugs or other issues. Use of Proby is also FREE while it is in beta.
Interested? Head on over to http://probyapp.com/signup and signup to participate in the closed beta! And, don’t forget to follow @probyapp on Twitter for updates!
What I Learned by Attending a Code Retreat
On Tuesday, July 26th, I attended my first code retreat. The code retreat was led by Corey Haines and Tyler Jennings, sponsored by Obtiva, and held to coincide with Tech Week. Simply put, it was invaluable experience, and I can’t wait for the chance to do it again.
What the hell is a code retreat?
For those who have never attended a code retreat, the format is simple. The group is given a problem. In our case the problem was Conway’s Game of Life. The code retreat is then broken up into several 45 minute sessions. Each session you pair with somebody new, and you take a stab at solving some portion of the problem. All code is written using TDD. At the end of each session you delete all of the code you created. The problem, and the time limit on the sessions, are specifically chosen so that there is little to no chance that you will come up with a complete solution in the allowed time. Solving the problem is not the point. The point is to practice.
We as software developers don’t give ourselves enough time to practice. At work, or even with open source or side projects, we’re always trying to get something done. Rarely, if at all, do we practice solely for the sake of practice. This is what we focused on at the code retreat. At the end of each 45 minute session, we deleted our code, and discussed what we learned. We were encouraged to try new ideas, to really push ourselves outside of our comfort zone. Knowing that the code would be deleted in a matter of minutes, there was no need to worry about screwing up the design, or creating something that would ultimately become unmaintainable. We had a safety net, and were encouraged to use it.
Corey said that we never get the chance to write “perfect code”. Perfect code doesn’t exist in the real world. In the real world products need to ship. It is simply not practical to spend the amount of time on a piece of code that is required to make it “perfect”. There are always trade offs of one form or another. However, code retreats give you the opportunity to practice writing perfect code. There is nothing to ship, nothing to “get done”. The code lives for 45 minutes, and then it disappears, giving you a unique opportunity to practice writing tiny pieces of perfect code.
Structured chaos
Sounds pretty hectic, right? But, the sessions were far from unstructured. Throughout the retreat, we were encouraged to keep in mind the four rules of simple design:
- Runs all the tests
- No duplication
- Proper naming (expresses ideas and reveals intent)
- Small (small methods, small classes)
The morning sessions were primarily focused on getting familiar with the problem. As the day progressed, we were encouraged by Tyler and Corey to try out certain ideas, or to think about tackling old problems in new ways. Can you solve the issue using polymorphism instead of if statements? Can you do it by capping all of your methods to 3 lines? What happens when you focus on proper naming of your tests and your methods? These “challenges” were designed to push us out of our comfort zone. It worked.
Some thoughts
I quickly realized that 45 minutes was not enough time to solve the problem. However, it took me almost the entire day to finally get comfortable with the fact that even solving a significant portion of the problem was out of reach. About half way through the day, I caught myself trying to solve a chunk of the problem that was way too large for the amount of time we were given. This resulted in me rushing a solution in an attempt to get my way too large of scope test to pass. At the end of the session, I was left confused with a pile of crappy code (unnecessary abstractions, no encapsulation, etc), at least half of which I wasn’t sure why we wrote.
I love that feeling you get when you create something, or solve a problem. This is why I am a programmer. Not being able to finish the problem, or even a significant portion of it, was incredibly frustrating. It took me the entire day to finally let go of that.
It was also great to see the different ways people were approaching the problem. At the start of the code retreat most pairs were pursuing a similar approach. But by the end of the day, the variety of approaches was simply amazing. With each session, everybody learned something new. Since we were required to find a new partner for each session, the combined experience of each new team seemed to produce a new approach, slightly different from the approaches each of the partners had tried in previous sessions. It was almost like looking at an old tree. Each session resulted in the forking of each branch on the tree. By then end of the day, there were a whole lot of branches.
Practice, practice, practice…
At the end of the code retreat, we all briefly shared with the group what we learned that day. Everybody had something new to say. The code retreat had a way of showing all of us what we needed to work on and what we needed to explore.
When writing this blog post, it suddenly hit me.
This is exactly what practice is designed to do!
When I practice something else (karate basics or kata for example), it has the same effect. Serious practice makes it very clear what needs to be developed. It makes no difference what it is that you’re practicing. The key to getting better at something is to practice, and to continue developing the “problem areas” identified by your practice. But it can’t be just any practice. As Corey stated during the code retreat:
Practice doesn’t make perfect. Perfect practice makes perfect.
Perhaps the single biggest thing I took away from the code retreat is that it showed me how to practice writing code.
What I learned
First off, I learned that I need to slow down…way down. All too often I think I know the solution to a problem before my fingers even touch the keyboard. And, perhaps worse, is that I am so focused on implementing that solution, that I will often ignore warning signs (code smells) that should be alerting me that my solution may perhaps not be the best solution. This code retreat made me realize that I need to do a much better job of listening to my tests, and listening to my code. I need to do a better job of keeping my nose open at all times for code smells, and to stop when I smell something. Something that doesn’t feel right in the tests is a sure sign that something is wrong with the design. Stop, take a step back, think for a minute about what is causing the smell, and fix it.
There is nothing wrong with having a solution in mind before you start coding. In fact, I think that most times you want to have a solution in mind before your fingers touch the keyboard. However, you should keep an open mind as you start to build your solution. Don’t assume your initial solution is the best solution. In fact, it’s probably safer to assume that it is not, and allow your tests to guide you to a better one.
Go to a code retreat!
If you’re serious about getting better as a programmer, and have never been to a code retreat, your missing out on an incredible learning experience. Make it a point to go to one. Since code retreats are essentially group coding practice, you can attend them over and over and be sure to walk away with something new every time. I’m already looking forward to the next one.
I have so much to learn.
Fast Queries on Large Datasets Using MongoDB and Summary Documents
The past few months we at Signal have been spending a considerable amount of time and effort enhancing the performance of our application for some of our larger customers. It wasn’t that long ago that our largest subscription list was only 80,000 subscribers. We now have many customers with lists topping a million subscribers, with our largest currently sitting at 8.5 million. That’s quite a jump in size, and not one that can generally be made without making a few tweaks to the application. With this jump, certain areas of our application began to slow down considerably. One such area was subscription list reporting. Many of the reports were backed by SQL queries that were becoming increasingly expensive to run against the ever growing tables.
To address this issue, we decided to create daily summaries of the subscription list data, and report off of the summary data instead of the raw data. The vast majority of our list reporting is already broken out by day, so this seemed logical. And, we decided to use MongoDB to store the summary data.
Why MongoDB?
We are already running MySQL and CouchDB in production. Why not use one of those? Why introduce a 3rd database product to the architecture?
The summary data is live data, not archived data (like the data we are storing in CouchDB). As subscriptions are created/destroyed for the current day, we need to increment/decrement the appropriate metrics. There are also some use cases where we need to alter summary documents for days in the past. This means we need support for atomic operations in the underlying database, to prevent race conditions from skewing the stats. CouchDB does not have the ability to atomically update a value in a document with a single call, so that took it out of the running. Some of our more active lists see several new subscriptions a second. Dealing with document update conflicts would have been a nightmare.
We decided to go with MongoDB over MySQL because of the structure of the summary data. We had several metrics that we wanted to keep tabs on (opt ins per day, opt outs per day, etc). Some of these metrics had a nested nature to them. For example, in addition to keeping track of the number of opt-outs per day, we also want to keep track of the reasons those subscriptions were canceled. Also, there are several ways that a user can opt-in to a subscription list. For example, each list can have several different keywords for entry via SMS. We needed a way of breaking all of these metrics out by method of entry.
Also, the main data structure containing the stats is repeated twice within the document. Once for the current day’s stats, and once for the totals for that subscription list up until the date in the document. This “to_date” data structure keeps us from having to evaluate ALL of a list’s documents in order to determine how many subscribers there are on a given date. With the “to_date” data structure, the specific day’s document is all we need.
We decided that this data structure was better represented as a single JSON document instead of a series of tables in MySQL. It seemed much cleaner to have a single document per list, per day that contained all of the information for that list’s daily activity than to have it scattered about in a series of relational tables. Choosing MongoDB meant giving up on SQL’s aggregate functions (AVG(), MAX(), MIN(), SUM(), etc), but the benefits provided by the simplicity of the data structure seemed to make up for the loss of these functions.
An example summary document
{
"campaign_id": 1,
"subscription_campaign_id": 2,
"account_id": 1,
"date": ISODate("2010-10-13T00:00:00Z"),
"stats": {
"sms_101": {
"confirmed_opt_ins": 100,
"unconfirmed_opt_ins": 15,
"unconfirmed_opt_outs": 10,
"unconfirmed_opt_out_reasons": { "UI": 5, "CNF": 5 },
"confirmed_opt_outs": 20,
"confirmed_opt_out_reasons": { "UI": 10, "CNF": 10}
}
"sms_102": {
"confirmed_opt_ins": 200,
"unconfirmed_opt_ins": 30,
"unconfirmed_opt_outs": 20,
"unconfirmed_opt_out_reasons": { "UI": 5, "CNF": 15 },
"confirmed_opt_outs": 35,
"confirmed_opt_out_reasons": { "UI": 15, "CNF": 20 },
}
"email": {
"confirmed_opt_ins": 300,
"unconfirmed_opt_ins": 35,
"unconfirmed_opt_outs": 25,
"unconfirmed_opt_out_reasons": { "UI": 20, "BULK": 5 },
"confirmed_opt_outs": 30,
"confirmed_opt_out_reasons": { "UI": 20, "BULK": 10 }
}
}
"to_date": {
"sms_101": {
"confirmed_opt_ins": 1100,
"unconfirmed_opt_ins": 1000,
"unconfirmed_opt_outs": 200,
"unconfirmed_opt_out_reasons": { "UI": 100, "CNF": 100 },
"confirmed_opt_outs": 400,
"confirmed_opt_out_reasons": { "UI": 300, "CNF": 100 }
}
"sms_102": {
"confirmed_opt_ins": 2200,
"unconfirmed_opt_ins": 200,
"unconfirmed_opt_outs": 150,
"unconfirmed_opt_out_reasons": { "UI": 100, "CNF": 50 },
"confirmed_opt_outs": 400,
"confirmed_opt_out_reasons": { "UI": 250, "CNF": 150 }
}
"email": {
"confirmed_opt_ins": 2050,
"unconfirmed_opt_ins": 125,
"unconfirmed_opt_outs": 75,
"unconfirmed_opt_out_reasons": { "UI": 75 },
"confirmed_opt_outs": 750,
"confirmed_opt_out_reasons": { "UI": 600, "BULK": 150 }
}
}
}
A possible relational database schema
MongoDB’s Atomic Operations
As I mentioned above, MongoDB’s atomic operations were key to us choosing MongoDB for this task.
MongoDB does not support locking or transactions like you would find in a traditional relational database. However, it does support a number of operations that are guaranteed to be atomic on a single document. If your data is designed so that all related data (or at least data that would need to be updated at the same time) is contained within a single document, then the atomic operations supported by MongoDB should be more than enough to handle the majority of use cases.
For this project, we made heavy use of the $inc operation. $inc will atomically increment or decrement a value anywhere in the document by the specified value. If no property exists in the document with that name, $inc will create it, and set its initial value to the value you wanted to increment it by. For this project, we simply initialized the data structure holding the metrics to an empty JSON hash when the summary document is created. The first time $inc is used to increment or decrement some metric, it will insert the metric into the hash, along with the proper initial value. Subsequent calls to update the document using the $inc operator will then update that value accordingly.
Using the $inc operation also meant we didn’t have to read the document to get the current value of the field in order to increment/decrement its value. We were simply able to increment/decrement the value by making one call to the database, keeping things nice and simple.
Atomically incrementing a document’s confirmed subscriptions count
collection = db.collection('subscription_statistics')
today = Date.today.to_time
collection.update({'campaign_id' => 1, 'date' => today},
{"$inc" => {"stats.sms_101.confirmed_opt_ins" => 1}},
{:safe => true})
MapReduce
MongoDB also supports MapReduce, providing you the ability to evaluate your data in ways that simply can’t be done using their standard queries. For this project, we needed to support summing the values of specific keys across several documents (to calculate the total opt-ins over a date range, for example). Initially, this sounded like a good fit for MapReduce. However, MapReduce will run the specified map function against each document in the database. The more documents your database has, the longer it will take MapReduce to run. I hoped that since we had built an index for the fields that MapReduce was using to determine if a document should be selected, that MongoDB would utilize that index to help find the eligible documents. Unfortunately that was not the case.
In our case, since we are only dealing with 365 documents for a year’s worth of statistics, it was considerably faster for us to find the documents using MongoDB’s standard queries and sum the data in ruby code, than to use MapReduce to do the same. If we were evaluating ALL of the documents in the database, then MapReduce would have been a much better option. I understand that 10gen is hard at work on making MapReduce faster for MongoDB 2.0. Having a strong MapReduce framework can be a powerful tool for a statistics database. Hopefully we’ll be able to utilize it in the future.
The Results
The results were staggering. On our largest list, the overview chart we display showing the current number of subscriptions per day over the last 30 days went from taking 37726ms to load to just 502ms. And the summary report for the list, which contains a wealth of statistics for the list including opt-ins / opt-outs per day, subscriber count by opt in keyword, and a series of summary statistics, went from taking 64836ms to load to just 322ms.
Dramatically reducing the size of the data being evaluated had an equally dramatic effect on the amount of time it took to evaluate that data. And, MongoDBs atomic operations and dynamic queries made this project a blast to work on.
Optional method parameters in Ruby
One of the things I love about Ruby is the flexibility it provides when it comes to method parameters. It’s dead simple to provide default values for one or more parameters, to handle variable length argument lists, and to pass blocks of code into a method. But perhaps my favorite is the ability to tack hash key/value pairs onto the end of a method call, and have those options combined into a Hash on the other side.
def some_method(required_1, required_2, options={})
# Do something awesome!
end
some_method("foo", "bar")
some_method("foo", "bar", :option_1 => false, :option_2 => true)
some_method("foo",
"bar",
:option_1 => false,
:option_2 => true,
:option_3 => "something",
:option_4 => "something else")
This may not look like much. However, this feature alone is capable of producing some very readable code, and is used extensively in APIs throughout the Ruby ecosystem. Consider for a moment what these APIs would look like if Ruby did not have this capability, which isn’t hard to imagine for those of us with a background in a language like Java. You would either be forced to require that each parameter be specified:
# What is this code doing? What do the nil values,
# or even the true and false values map to?
some_method("foo", "bar", false, nil, true, nil)
or accept a hash or a request object that contains all of the necessary parameters:
# This is much more readable, but requires that the
# options hash be created on its own line.
options = {:option_1 => true, :option_2 => false}
some_method("foo", "bar", options)
Providing optional parameters via hash key/value paris at the end of a method call produces code that is incredibly readable. You have the names of the attributes right next to their corresponding values! There is no ambiguity whatsoever as to which values match up with which parameters.
It is also very flexible. The order of the attributes in the hash does not matter, like it does for required attributes. And, it is very easy to add new options, or delete old ones.
This approach also makes it easy to specify default values for options that were not specified when calling the method:
def some_method(required_1, required_2, options={})
defaults = {
:option_1 => "option 1 default",
:option_2 => "option 2 default",
:option_3 => "option 3 default",
:option_4 => "option 4 default"
}
options = defaults.merge(options)
# Do something awesome!
end
There are however a few minor drawbacks to this approach. The first is documentation. Methods that take a hash of options as a parameter do not convey any information about the valid options in the method definition alone. And, it is possible that the method in question simply forwards the options to another method, sending you on a wild goose chase to determine the set of valid options the code supports.
# Looking for a list of valid option keys...no help here.
def some_method(required_1, required_2, options={})
do_something_awesome_with_the_options(options)
end
This is why it is so important do document your public API if you are using this approach. Take a look at the ActiveRecord::Associations::ClassMethods documentation. This page documents, in a very clear and easy to read mannor, all of the supported options for each method.
It is also worth pointing out that while this approach is great for optional parameters, it is ill suited for required parameters. Required parameters should be specified outside of the options hash, making it clear that values for the required parameters must be provided. While it’s true that stuffing all of your parameters inside a hash means you’ll never have to look at another wrong number of arguments error again, it will make your code difficult to understand, and easy to misuse.
CouchDB Plugins for Scout
Back in December I whipped up a series of CouchDB plugins for the Scout monitoring service. The plugins allow you to track all sorts of metrics for CouchDB, including (but not limited to):
- Mean reads / second
- Mean writes / second
- Mean requests / second for DELETE, GET, HEAD, POST, and PUT requests
- Mean view requests / second
- Mean bulk HTTP requests / second
- Counts for various HTTP response codes
In addition, there is a plugin for individual CouchDB databases and individual couchdb-lucene indexes. The database plugin will report:
- Database size
- Number of documents
- Number of deleted documents
- Number of update operations
The couchdb-lucene plugin will report:
- Size of the index
- Number of documents indexed
- Number of deleted documents
The kind folks over at Scout have just released two new, official plugins based on the ones I created. The CouchDB Overall plugin combines some of the more important CouchDB metrics into a single plugin, and the CouchDB Database plugin reports the same set of the stats as the database plugin listed above.
The original plugins can be found at https://github.com/signal/scout-plugins/tree/master/couchdb. More information can be found here. I hope you find them useful.
Thanks to Doug Barth for some help on the plugins, and Derek over at Scout for putting together the official plugins.
GitHub
Most Popular Posts
Tags
Archives
- December 2011 (1)
- September 2011 (1)
- July 2011 (1)
- May 2011 (1)
- April 2011 (1)
- March 2011 (1)
- January 2011 (2)
- November 2010 (2)
- September 2010 (1)
- August 2010 (1)
- July 2010 (2)
- June 2010 (2)
- April 2010 (1)
- March 2010 (1)
- February 2010 (2)
- January 2010 (1)
- December 2009 (1)
- November 2009 (1)
- September 2009 (2)
- August 2009 (3)
- July 2009 (2)
- June 2009 (3)
- April 2009 (1)
- February 2009 (1)
- January 2009 (2)
- December 2008 (8)
- November 2008 (2)
- October 2008 (3)
- September 2008 (6)
- July 2008 (3)
- June 2008 (1)
- May 2008 (8)
- April 2008 (6)
- March 2008 (2)
Blogroll
Industury News
Other Links
My GitHub Feed
- jwood pushed to stripe at signal/proby
- jwood pushed to stripe at signal/proby
- jwood pushed to stripe at signal/proby
- jwood pushed to stripe at signal/proby
- jwood pushed to stripe at signal/proby
- jwood pushed to stripe at signal/proby
- jwood pushed to stripe at signal/proby
- jwood pushed to stripe at signal/proby
- jwood pushed to stripe at signal/proby
- jwood pushed to stripe at signal/proby






