pysmug, tag clouds, asynch IO and the SmugMug API. -

pysmug, tag clouds, asynch IO and the SmugMug API.

June 12, 2008

A question was asked on a dgrin thread about whether the SmugMug API supported building a tag cloud – it doesn’t. A responder suggested it would take far too long to generate one from the API since you’d have to trawl through every photo. This is indeed true, but you don’t have to do it serially. I consider the batchable interface for pysmug to be it’s selling point and building a tag cloud is the perfect demonstration.

In order to get the results for my 80+ albums and 3200+ photos I need to make one call to get the full list of albums and then one call each for every photo. If this was being done serially, then I’d give up too, but under pysmug sits pycURL+libcurl which are very fast at handling many, many simultaneous requests.

Here’s the code:

def tagcloud(self, kwfunc=None):
  """
  Compute the occurrence count for all keywords for all images in all albums.

  @keyword kwfunc: function taking a single string and returning a list of keywords
  @return: a tuple of (number of albums, number of images, {keyword: occurences})
  """
  b = self.batch()
  albums = self.albums_get()["Albums"]
  for album in albums:
    b.images_get(AlbumID=album["id"], AlbumKey=album["Key"], Heavy=True)

  images = 0
  kwfunc = kwfunc or _kwsplit
  cloud = collections.defaultdict(lambda: 0)
  for params, response in b():
    album = response["Album"]
    images += album["ImageCount"]
    for m in (x for x in (y["Keywords"].strip() for y in album["Images"]) if x):
      for k in kwfunc(m):
        cloud[k] = cloud[k] + 1

  return (len(albums), images, cloud)

The big win here is I’m not waiting on sum(response times) but rather on max(response times) because the requests are being handled asynchronously and the responses are coming back as soon as they’re ready. If I remove the use of the batchable and instead make the requests serially I wait much, much longer: batchables create the cloud in less than 30 seconds, serially it takes just under three minutes. This works out to around 110 requests/second for the batchable and 19 requests/second serially. I’d say that’s an impressive performance improvement.

This new method is available on tip and will be released with v0.5 (though it’s easily back-patched to v0.4). There are a number of other batchable examples in the SmugTool class.

I love asynchronous IO – concurrently handling many requests with a simple API makes me happy; using only one thread makes me happy too.

pysmug, tag clouds, asynch IO and the SmugMug API.

Recent Posts

Woods, angels, oysters.

Upper Dungeness Trail

Yellow pine cone.

Fishing, crabbing, fun.

Mountains, planes, sky.

About