On the Cusp of Big Data

Alternate title: Episode IV: A New Hope

I’m restarting this blog after a long hiatus – a couple of years, at least. It looks like my old posts were purged in the meantime, but that’s probably for the best1.  My musings about Hibernate from 2007 are probably not that interesting now2.

Where I’m coming from:

I work on a team that, in a lot of ways, is on the cusp of Big Data:  we deal with gigabytes, but not terabytes of data, and we don’t have endless racks of commodity servers.  We have a homegrown task framework that follows a typical master-worker pattern and allows for tasks to be distributed among nodes on different servers.  That works well, and I like the framework in general – it could use some cleaning up, but it’s simple, clean and functional.

Data, though, is all stored inside an Oracle database, and we’re knocking at the edge of it’s capabilities.  We haven’t entirely maxed it out yet, but each performance gain has been harder to come by, and we can easily see the time approaching where it will be cheaper to rearchitect how we’re storing and serving data rather than eke more performance by smarter partitioning, better queries, or a faster SAN.

So over the past several months I’ve been reading about some of the competitors in the big-data field, and sketching ideas for what I’d like our architecture to look like going forward.  Things like MapReduce (Hadoop), HBase, Cassandra, or Terracotta, or a number of other ideas – different types of products, all with the goal of scaling data beyond a single server. But unlike a lot of folks looking at these options, we have an existing product in production, based on a framework that does 80% of what we need. So I find myself on a seesaw, going between the newest, coolest thing I’ve read about, and then the pain of rewriting what we have on a non-existent timeline when what we have works so well – at today’s data volumes.

I decided to reinstate this blog to collect what I’ve learned so far.

1 Does anyone else have the problem of starting blogs like new years’ resolutions and then losing track of them? I probably have three out there, tied to some forgotten username on goodness knows which host, and they’re probably saying very insightful things about Hibernate 2.0.

2 Turns out, I did recover some old posts from a Typo blog from 2006. A couple of the most boring just didn’t make the move over, but most of them are here for morbid curiosity about what seemed interesting at the time.