orbitz, open source and me

2008-06-30

Some friends at work were recently interviewed by Matt Asay on the release of our monitoring software, ERMA, as open source — an unconventional move for corporate America.

We have a long and quiet relationship with open source at Orbitz. In the article Matt O’Keefe was kind enough to throw a compliment my way:

We have a history of contributing to other open-source projects. Brian Zimmer and others on the team have been very active in open-source projects.

You can read more coverage about open-sourcing ERMA here and be sure to check out the real-time visualization software we released, Graphite, as well.

Congrats Matt!

Categories : development

SmugNDrag v1.4.

2008-06-17

I’m happy to announce SmugNDrag v1.4 has been released. The primary change is the addition of the Sparkle framework to automate version updates with a minor change to the UI. The new release is available here. Enjoy!

Categories : development   photography
Tags :       

Google Seattle Conference on Scalability — 2008 Edition.

2008-06-16

Overview

I was really looking forward to this conference based on my experience last year, with the likes of Jeff Dean and Marissa Mayer presenting. When I saw the original agenda for Saturday I was excited to see they expanded the number of talks at the expense of having to make a decision about which presentation to attend, a task at which I often feel I failed.

When I arrived, late, I was surprised to see they decided to change the format and rather than have two tracks for each session, the presentations were shorten so everyone could attend every talk. I’m not sure how much notice the presenters were given of this decision because a number had presentations well exceeding the diminished time frame. As a conference presenter myself, I know that a well re-hearsed presentation can be difficult to amend on the fly.

Communicating Like Nemo

I’m not sure what I was supposed to get out of this presentation. I understand that working under water places significant constraints on connectivity, bandwidth and other factors but I didn’t feel like I really learned much about how these are being overcome. I did get to brush up on my PADI hand signals — it’s been awhile since I dove last.

maidsafe

Since I arrived to the conference a bit late I was seated towards the rear of the room for the first two talks. The presenter chose to use the whiteboard as a primary presentation medium, which as a friend said demonstrates he really has confidence and knows his shit, but for me was unfortunate since I could barely hear the presentation nor see the board. Since my mind was already deep in debugging objc’s forwardInvocation: I chose to leave the room and finish my work, in which I’m happy to report success. Afterwards, I learned this talk was pretty good if you could see and hear.

Chapel

Fantastic. This was the quality and topic of talk I was looking forward to seeing. Chapel is a new programming language coming out of Cray which:

supports a multithreaded parallel programming model at a high level by supporting abstractions for data parallelism, task parallelism, and nested parallelism. It supports optimization for the locality of data and computation in the program via abstractions for data distribution and data-driven placement of subcomputations.

It supports constructs within the language to create and execute arbitrarily nested tasks via a begin keyword and join on the results of those calculations via sync. Furthermore, it can execute the same tasks in parallel by using cobegin and coforall operations without changing the underlying code. This is an improvement over the current state-of-the-art MPI programming which forces the developer to have intimate knowledge about both the high level logic of the application and the distributed runtime, creating difficult to maintain code. Chapel also supports synchronization of tasks in a similar, data-driven manner.

In addition to the task and data parallelism, Chapel supports the idea of locales which can be CPUs, cores or separate machines entirely. Through lower level constructs such as locale and on, the developer can specify where tasks should run and how resources are accessed and utilized.

This is pretty exciting. I like the approach of high-level, don’t-worry-about-it language features with the ability to dig deeper if necessary. Unfortunately, I’m not sure this language will ever see a line of code from the likes of me given its intended problem domain and hardware.

Carmen

The scientific community is plagued by a number of issues regarding research such as a myriad of file formats, no central repository for data and limited data sharing and analysis. Carmen addresses some of these concerns through the implementation of a domain-specific cloud architecture. In many ways it looks and feels like EC2+AWS but it addresses the specific needs of the science community, such as the security model for collaboration and the cost structure of using the commercial clouds given the cost for data storage would be extraordinarily high.

In order to carry out experiments or analysis, data and services are uploaded to the cloud and then a workflow is created to integrate, via SOAP, the binary services (WARs, executables). During the runtime of the analysis, if additional services are required (based on numerous metrics) they are automatically created and deployed. This sounds a lot like a combination of EC2 and AppEngine.

The presenter also showed a photo of an exposed human brain from an operation — unexpected at a computer conference.

GIGA+

This was one of those talks that was, for me, better for the bits of take-away material than the actual product being presented. For example, when a node reaches storage capacity in GIGA+ it splits some of the data elsewhere. In order to achieve limited-to-no locking, each node keeps a table of where it sent data so every client doesn’t have to be updated right now but instead can be lazily updated. If a client makes a request to the old node because of a stale view of the world, the request is forwarded, ala HTTP, and the client updated. I also learned about extendible hashing and bitmap management of partition locations.

This could have been a more interesting talk but a lot of assumptions about the operating environment were made making it more or less unrealistic at the moment, such as: the network is always reliable, the configuration is static, no offline disconnected mode.

Google Maps Mobile

A light, but interesting overview of the problems facing mobile development: lots of OSs, form factors, bandwidth, available storage, security, localization, …

Wikipedia on Erlang

This talk should have replaced Erlang with DHT in the title for it was really about replacing a typical large-scale MySQL cluster of databases with a DHT+transactions to implement a clone of Wikipedia. As far as I could tell, Erlang was used a pseudo-message bus with more development in Java integrated with Erlang via JInterface. In the end, this looked like a similar implementation of SimpledDB or any of the other key-value stores.

NetWorkSpaces

NetWorkSpaces is a Python-implemented (twisted and Zope) tuplespace integrated with R to provide parallel computation for the otherwise serial computational model of R. Given the almost commodity-like tuplespace environment, it seems the real advantage here is the integration with R and not the tuplespace itself (again see SimpleDB, …), though the presenter pointed out NWS would run on any platform which runs Python. The typical deployment is small, around 12-16 nodes, because that’s a normal installation more than a limitation of the architecture.

Shared Transactional Memory

A good, general overview of the problems facing language and hardware (Azul, Sun Rock) developers and engineers as they attempt to address transactional memory. I thought the presenter did a nice job of demonstrating the issues through code examples but as with any [H|S]TM presentation, it was light on answers and heavy on “that needs to be figured out”.

Conclusion

I was glad I went for the Chapel talk and enjoyed the Carmen, GIGA+ and STM talks.

One of the themes I took away was while cloud computing has become mainstream there’s a need to add the domain-specific abstraction on top of it, not too dissimilar really to the ever-growing popularity of DSLs implemented in mainstream languages.

I liked last year’s approach better: fewer talks, more time for each presentation, more polished speakers and more technical content; I also liked the move to Seattle from Bellevue.

pysmug, tag clouds, asynch IO and the SmugMug API.

2008-06-12

A question was asked on a dgrin thread about whether the SmugMug API supported building a tag cloud — it doesn’t. A responder suggested it would take far too long to generate one from the API since you’d have to trawl through every photo. This is indeed true, but you don’t have to do it serially. I consider the batchable interface for pysmug to be it’s selling point and building a tag cloud is the perfect demonstration.

In order to get the results for my 80+ albums and 3200+ photos I need to make one call to get the full list of albums and then one call each for every photo. If this was being done serially, then I’d give up too, but under pysmug sits pycURL+libcurl which are very fast at handling many, many simultaneous requests.

Here’s the code:

def tagcloud(self, kwfunc=None):
  """
  Compute the occurrence count for all keywords for all images in all albums.
 
  @keyword kwfunc: function taking a single string and returning a list of keywords
  @return: a tuple of (number of albums, number of images, {keyword: occurences})
  """
  b = self.batch()
  albums = self.albums_get()["Albums"]
  for album in albums:
    b.images_get(AlbumID=album["id"], AlbumKey=album["Key"], Heavy=True)
 
  images = 0
  kwfunc = kwfunc or _kwsplit
  cloud = collections.defaultdict(lambda: 0)
  for params, response in b():
    album = response["Album"]
    images += album["ImageCount"]
    for m in (x for x in (y["Keywords"].strip() for y in album["Images"]) if x):
      for k in kwfunc(m):
        cloud[k] = cloud[k] + 1
 
  return (len(albums), images, cloud)

The big win here is I’m not waiting on sum(response times) but rather on max(response times) because the requests are being handled asynchronously and the responses are coming back as soon as they’re ready. If I remove the use of the batchable and instead make the requests serially I wait much, much longer: batchables create the cloud in less than 30 seconds, serially it takes just under three minutes. This works out to around 110 requests/second for the batchable and 19 requests/second serially. I’d say that’s an impressive performance improvement.

This new method is available on tip and will be released with v0.5 (though it’s easily back-patched to v0.4). There are a number of other batchable examples in the SmugTool class.

I love asynchronous IO — concurrently handling many requests with a simple API makes me happy; using only one thread makes me happy too.

SmugNDrag v1.3

2008-06-08

Based on some user feedback I added the ability to choose the destination — gallery or lightbox — at SmugMug. Download it here. Enjoy!

SmugNDrag v1.3

Categories : development   photography
Tags :       

SmugNDrag: scratching an itch.

2008-06-07

I use MarsEdit to write blog posts and SmugMug to manage online images. Unfortunately, MarsEdit, supporter of flickr integration, lacks connectivity with SmugMug. Enter SmugNDrag.

SmugNDrag automates the generation of image links suitable for blogging: an embedded image which when clicked navigates to the full image in the SmugMug gallery. It understands SmugMug naming convention to render the correct image size and can optionally populate the image description.

smugndrag

It’s pretty straight forward to use, simply navigate to the desired image at SmugMug in your browser and copy the url to SmugNDrag. The resulting url is copied to the pasteboard similar to the Share functionality at SmugMug, only with a richer link.

It’s available here. Feedback welcomed. If you use it, post your blog in the comments.

Categories : development   photography
Tags :       

There’s an espressohound running on Google App Engine.

2008-05-31

My first Google App Engine application, espressohound, has been deployed. It’s a totally barebones espresso tasting log with almost no interesting features, but it’s at least minimally useful.

I’m tired — I’ll have more to say after I sleep on it but given the total vendor lock-in and significantly locked-down Python interpreter I’d be surprised if I spent a lot more time on this (though I do really want the espresso tasting application).

Categories : development
Tags :