Last week, I was working on a new feature of TextMe that required a call to one of our external service providers for some data. The call in particular was to lookup the carrier for a given mobile number. Sounds simple enough. However, we already had code that integrated with this provider in one component of our architecture, and I needed to make this call from another component.
A couple of options jumped out at me. I could pull the code I needed to use into a library that could be shared between the components, or implement some form of inter-process communication that would enable me to invoke the service from the one component, and have it processed by the component that already integrated with the service provider.
Pulling the code into a library would be the easier of the two to implement for sure. Like any project of reasonable size, we were already doing this for several other shared pieces of code. Adding one more to the list would be a piece of cake. The second option would require a bit more work. The component that integrates with the service provider runs as a daemon process, so using something straightforward like HTTP to handle the interprocess communication was out of the question. Instead, I’d likely have to utilize the queuing framework that we already had in place. What makes it more difficult is that the queuing library we use only handles asynchronous calls, and this would need to be a synchronous call. Not the end of the world by any means, but without a doubt more complicated than simply sucking the code into a library.
Even though option one was easier to implement, having two components in the architecture integrate with a 3rd party seemed like a bad idea. Sprinkling integration points throughout your application is usually a recipe for failure. Largely because it is only a matter of time before an integration point fails.
If we went with option one, we could have the library handle the failures. However, even if handled properly, failures like this usually have other consequences. For example, if the service never responded, it could cause requests to back up in the given component. Even if we implemented a timeout, it is likely that the timeout would be greater than the average response time, which means our system would take longer to process each request. If you had to deal with a lot of incoming requests at the time of the failure, you could be in for a world of hurt, especially if you had multiple components suffering from this issue.
With option two, we have a bit more control over the situation. First off, we would know there was one, and only one spot in our architecture that integrated with that particular service. This would allow us to better understand the potential impact of the failure, and the steps that needed to be taken to address it. Second, it would allow us to more easily implement a circuit beaker to prevent the failure from rippling across the system. If the circuit breaker was tripped, we could return an error, some sort of filler data, or queue the request up for processing at a later time. Third, we could potentially add resources to account for the situation. Since the work was being done in a completely different component, if it was simply a matter of increased latency on the part of our service provider, we could always spin up a few more instances of that component to account for the fact that some of the requests may be starting to back up.
In his fantastic book, Release It, Michael Nygard talks about integration points, along with a host of other topics regarding the deployment and support of production software. Any developer who writes code that will eventually be running in a production environment (which I hope is EVERY developer) should read this book. Regarding integration points, Michael says the following:
- Integration points are the number-one killer of systems.
- Every integration point will eventually fail in some way, and you need to be prepared for that failure.
- Integration point failures take several forms, ranging from various network errors to semantic errors.
- Failure in a remote system quickly becomes your problem, usually as a cascading failure when your code isn’t defensive enough.
However, even though integration points can be tough to work with, system’s without any integrations points are usually not that useful. So, integration points are a necessary evil. Our best tools to keep them in line are defensive coding, being smart about where you place the integration points in your system, and limiting the integration points in the system.
With the help of my colleague Doug Barth, we (mostly Doug) whipped up a synchronous client for the Ruby AMQP library. I then used this code to implement the synchronous queuing behavior I needed to keep the integration point where it belonged. Those interested can find the code in GitHub, at http://github.com/dougbarth/amqp/tree/bg_em.