Been there — a distributed ExecutorService.

2008-09-17

I happened to run across a new open-source project today I had not previously heard about: Hazelcast.

What really piqued my interest was the implementation of a distributed ExecutorService as I wrote something of similar functionality to façade Orbitz’s Jini infrastructure. The design and implementation solved a couple primary objectives:

  • add timeouts to Jini which is unfortunately lacking such a feature
  • bound the number of concurrent requests being processed
  • bound the number of threads created for request processing

The design was elegant, imo, because the exact same code worked either client or service-side — it just mattered which way you twisted your head — and masked the complexities of both Jini and the ExecutorService. The timeouts were managed via the Future and the throttling of requests and threads by configuring the backing Queue and pool size. Spring wiring entirely hid the remote invocation machinery from the caller. In almost every case there were no code changes and the timeout and throttling features could be turned on and tuned entirely through a configuration change.

The primary flaws in the implementation were in the difficulty of passing around ThreadLocal required context but that’s a pain-in-the-ass regardless and should be avoided if possible. The other concerned a slightly awkward callback mechanism for managing the timeouts. Hazelcast returns the Future directly to the caller but we choose to abstract this away so the caller coded to an interface which offered nothing about the possibility of being invoked remotely. To compensate, a callback mechanism could at runtime change any pre-configured timeout values based on the interface and/or the parameters of the invocation.

I’ve spoken before about the trade-off of abstracting remote invocations. On one hand it ensures discovery, error handling and invocations are accomplished consistently but on the other it invites developers to ignore the realities of a distributed system which can lead to the if-it-looks-local-it-will-be-coded-as-though-it-is-local problem. Dan Creswell offers:

I believe the best chance we have for doing distributed right is not by providing some de-facto standard toolset, rather it’s through education and mentoring to encourage the correct mindset. Such a mindset allows a developer building a distributed system to choose the most appropriate tools and use them right.

I agree that’s definitely the best long-term solution but in the meantime the site needs to be up.

4 comments

  1. “Hazelcast returns the Future directly to the caller but we choose to abstract this away so the caller coded to an interface which offered nothing about the possibility of being invoked remotely.”

    Philosophical question: Is the returning of a Future directly an indication of remoteness or merely the fact that the processing is asynchronous and may not complete in the desired time? If it were the latter would you then feel bad about returning the Future to the developer?

    “I agree that’s definitely the best long-term solution but in the meantime the site needs to be up.”

    I hear this argument a lot and to some extent it stands up. However, there comes a point (usually related to growth in usage or some analog) where to continue to keep the site up requires the change in mindset. Where is that transition point? How do/Can you make the transition smoothly? How much will it cost you and can you afford it or will the system collapse in on itself as you battle with your limited resources to sort it all out?

    A certain amount of history suggests that actually no-one bothers with worrying about the transition and then has to suffer the associated impact of a major re-architecture, mindset change and re-structuring of the codebase. Amazon’s story of moving from Obidos to where they are now is a fine demonstration (and they borrowed a substantial amount of money to finance the change), eBay seemingly have gone through somewhat similar troubles.

    However history also shows us that there are warning signs of the oncoming troubles which could allow us to confront the issues before they really, really hurt.

    Dan Creswell, September 24, 2008
  2. Hey Dan!

    The latter. And in the abstraction detailed in this particular blog post I would feel bad. In other contexts, absolutely not.

    The existing Jini invocation semantics, as you know, are synchronous without timeout. We wanted to honor this existing contract as best we could while achieving our objectives. Had we chosen to directly return a Future, change the tens of thousands of lines of code to handle the new semantics and educate every developer on the proper usage of those newly returned Futures, Orbitz would still be rolling out the enhancements. Instead, we opted to bury it to get the changes out the door faster, more consistently and with as minimal impact as possible.

    So what happened? We achieved some of our greatest quarters of site stability *and* developers started asking questions about the use of asynchronous execution and Futures, we did some talks on how they work and their advantages vis-à-vis naked invocations. Over the course of a couple of months the use of the java.util.concurrent package became widespread and used in contexts completely void of our original pursuit. We achieved greater uptime and education, win-win.

    We identified the warning signs (it wasn’t too difficult) and made a choice on implementation — in this case it worked.

    There is no *one* transition point. A myriad of factors offer influence: technology, development/architecture skill, business requirements to name a few. Experience offers guidance.

    If you want to talk about really, really hurting: Java serialization and dependencies.

    bzimmer, September 24, 2008
  3. “There is no *one* transition point. A myriad of factors offer influence: technology, development/architecture skill, business requirements to name a few. Experience offers guidance.”

    Mmm, dunno whether we’re talking about the same transition :)

    There are many transition points and they are indeed driven by the things you suggest and certainly experience helps. So in your case you are saying that you have expanded the number of developers who now deal with remoteness or do you still hide it from the majority?

    Perhaps you are referring to the transition from little use of util.concurrent to lots?

    In respect of my scribbling above I was suggesting that many companies ignore the issue until it gets so painful that to undo it requires a fairly major effort and substantial re-education.

    On the hurt, do you mean at the Jini client-side interface or more generally?

    Good to catch up :)

    Dan Creswell, September 24, 2008
  4. I read “transition” to mean when do you decide to “just do it” and when do you decide to stop, educate, learn, architect and go forward.

    So in your case you are saying that you have expanded the number of developers who now deal with remoteness or do you still hide it from the majority?

    Perhaps you are referring to the transition from little use of util.concurrent to lots?

    In the case of Jini invocations, still hidden I do believe (I no longer work at Orbitz) in the programming APIs but more top of mind in thought and understanding.

    As you sensed, I was referring more to the transition in mindset about concurrency, remoteness and such brought to light after demonstrating advantages drawn from new ways of thinking (java.util.concurrent, programming in Erlang, …)

    On the hurt, do you mean at the Jini client-side interface or more generally?

    Both. I ought to expand this into a post of its own.

    Good to catch up :)

    Indeed!

    bzimmer, September 24, 2008

Leave a comment