Getting Started With Solr and Clojure

I’ve been doing a lot of work with Solr recently, at work and on some personal projects. I use Clojure for most personal projects these days, and the few guides out there about using Clojure with Solr (as of this writing) were fairly out-of-date, so I thought I’d offer this one.

The primary way of interacting with Solr in Java is via the SolrJ client library. You can use it directly in Clojure if you want – it’s a fairly simple library with a straightforward API – but there are no less than 4 libraries that wrap it in Clojurey syntax:

Of these, Flux, by Matt Mitchell, is the most recently updated and the only one that bundles SolrJ 4.x. Solrclj uses version 3.6.2, so it would likely work fine with 4.x without modifications, but I haven’t tried. The other two are significantly older.

If you’re interested in following along, I’ve put together a quickstart application that you might find useful. The samples in this post are all in that project. I used Flux as the Clojure wrapper of choice, but it would be simple to switch that out for Solrclj or to use SolrJ directly, if you felt so inclined.

A note on IDEs, etc.

Chances are, you already have a Clojure setup that you are quite happy with. But just in case you don’t:

  • If you’re comfortable with Emacs, go ahead and use that. There’s a lot of information about getting started with Emacs on clojure-doc.
  • If you’re not fully conversant in Emacs (and there’s no shame in that), and you’re coming from the Java world, I’ve included IntelliJ project files in the quickstart project. IntelliJ Community Edition and the Cursive Clojure plugin make for an easy way to get started. Eclipse and the Counterclockwise plugin work too, and are quite popular, but I haven’t used it in a while…if you want to contribute Eclipse project files for the quickstart, send me a pull request.

If you don’t have it already, download Leiningen, the most common Clojure build tool. It has a one-line installation script.

Setting up a sample Solr server

This sample application connects to a running Solr server. Setting up a server is simple. One simple way to do it:

Download Solr from and unzip it. Go into the examples/ directory and run:

 java -jar start.jar

That will start a sample Solr server. Keep that terminal open.


Now that there’s a Solr server running, time to fire up a REPL in the solr-clojure-sample project. If you’re using IntelliJ, you can do that using Run -> lein repl, and you’ll have the REPL there inside your IDE, and you can execute things in the editor window directly in the REPL and all that goodness that Emacs users are used to. You can also choose “debug” instead of “run”, and you’ll be able to set breakpoints, which Emacs users are not used to.

Alternately, you can just run “lein repl” from the command line. You miss out on the IDE integration goodness, but it will work fine.

Either way you start the REPL it will automatically load dev/user.clj (because project.clj says to). This loads a set of useful namespaces as well as some helpful functions for development: init, start, stop, go, and reset. For more on the philosophy behind these, see Stuart Sierra’s blog post or the related podcast. These have nothing to do with Solr specifically – they’re just a Clojure software architecture design pattern – but since the quickstart doesn’t make much sense if you don’t understand what they’re for, I’ll go into a little detail.

In short, there’s a map called blast.solr/config that contains a variety of configuration properties for the app. A production app would take these config settings and use them to build a Solr connection which it would then hold on to for the life of the application. How it holds onto that connection will vary by application… for development purposes, we just have an atom in dev/user.clj called “system”. Calling (go) from your REPL will use the config map in blast.solr/config and connect to your sample Solr server, then populate the system atom with that connection.

If for some reason you make changes that require resetting the connection, or changes to the configuration map, you can run:


This will close any existing connection, reload the namespaces, and reconnect.

Load Data

Solr isn’t very interesting without any data, so let’s load some. I’ve included a text file called “sonnets.edn” which contains all of Shakespeare’s sonnets, in EDN format. The fields are already named using some Solr conventions: *_t to denote text fields, *_i to denote ints, and so on. To load the data, make sure you’ve run (go) in the REPL to establish a connection, and then run:

 ;; We'll read a file line-by-line and load it into Solr
 (flux/with-connection (conn)
                    (with-open [rdr (io/reader "sonnets.edn")]
                      (doseq [line (line-seq rdr)]
                        (println line)
                        (flux/add (edn/read-string line)))
                      (flux/query "*:*"))) ; query our new data

Run some queries

The quickstart app has a few helper functions in user.clj that might be useful. q, for example, runs queries:

(q "*:*")

There are some commented-out forms there that will load some sample data into Solr for experimentation. For example:

 (q "line_t:desire")
 (q "line_t:desire" {:deftype "edismax"}) ; send query args if you want
 (q "line_t:desire" {:deftype :edismax})  ; flux  will convert keywords into strings, so use them at will.
 (q "{!edismax qf=\"line_t\"}desire")     ; version with local variables. Anything that's valid in a Solr query is valid here.


Refactoring and Technical Debt

There’s a lot of great ideas under the “agile development” umbrella. It now encompasses ideas pulled from older software engineering practices, manufacturing techniques, behavioral science, and goodness knows what else. But any method emphasizes some things at the expense of others, and in my experience a lot of projects have real problems with how to handle significant technical changes — whether they’re called refactoring, or redesign, or technical features — within the context of an agile practice.

I’m not going to solve that here, but I’m going to describe what I’ve seen in case others have answers.

First, for background, typical agile development methods say this about refactoring:

  1. It should be done continuously.
  2. It shouldn’t be called out as its own user story; it should either be done as part of an existing user story, or baked into the “slack” time of an iteration. As you work on any feature, you should clean up the code where you are. “Leave it better than you found it.”
  3. Meanwhile, don’t design anything too far in advance (YAGNI), and let them instead emerge over time.

I don’t disagree with that. But I think don’t think it always works quite so neatly. To break it down, let’s classify the “technical work” into three buckets1:

  1. Simple, localized refactors.
  2. Larger, system-wide refactors.
  3. Non-functional features.

For #1 (Simple, localized refactors), I think everyone agrees. Good development practice dictates that you leave the code better than you found it. If you’re working in some code that needs documentation, or has some complex logic should be broken down into easier-to-understand methods, or there’s some code duplication, clean it up. There’s no need for a separate user story or ticket; that’s just run-of-the-mill as-you-go refactoring. No arguments there, right?

But there are cases that aren’t so simple, and those are the cases I’ve seen projects have problems with.

larger, cross-feature refactors.

These are the refactors that are bigger. Perhaps you have a repeated data-access pattern that could be abstracted away, or you have three different patterns to do the same basic thing, and they need to be merged into one. This is the kind of thing that NEEDS to be done on a large codebase as time goes by2, or it becomes so unweildy that your maintenance time completely overshadows your development time, your new features take months when they used to take weeks, and ramp-up time for new developers is longer their average life expectency.

As an example, I worked on a project a few years back that had started out using Hibernate as its persistence layer. Hibernate was no longer working for us, for a number of reasons, so it was time to ditch it and find something else. To make matters worse, there were a few different attempts at non-Hibernate (or half-Hibernate) DAOs already in the codebase, but each worked differently and handled only a subset of our use cases. We needed to come up with a better solution, and ideally refactor old uses so that our data access patterns were consistent. That was going to take some time.

A lot of people would argue that this is no longer really “refactoring” as much as building a new feature—just an internal, developer-facing feature. I think you could argue it either way, but they each present some challenges for agile shops.

When they are needed, these refactors often fall into a gray area—not as huge as “we need to rewrite the application”, but not as small as “this interface needs some documentation.” This is the gray area that could, potentially, be done as part of a client-facing feature, but in doing so it will probably make that feature take drastically longer.

In this scenario, the iteration planning or grooming conversation might look like this:

Scrum Master (SM): “Ok, what tasks should we create for Story X?”

Dev: “To start with, we need to refactor Interface A, and change all users of it to use a new paradigm.”

SM: “Does that really need to get done for Story X? It seems unrelated.”

Dev: “It isn’t strictly necessary to implement the story, but that interface has turned out to be a leaky abstraction that has made all users of it more complicated as a result. We didn’t realize it would be such a problem when we wrote it; but now it just needs to be refactored. And since this story involves that same interface, now is the time.”

SM: “How long with that take?”

Dev: “Probably 20 hours, total.”

(they then proceed to break it down into smaller tasks)

…now things could go one of a few ways. In an ideal agile scenario, where timelines are entirely flexible and there is complete trust between product owners and developers, the rest of the conversation is:

SM: Ok.

But most scenarios aren’t ideal. And many (most?) timelines aren’t really entirely flexible, nor relationships as entirely trusting. So the rest of the conversation might be:

SM/Product Owner/someone with a stake in the timeline: “20 hours? Um… we don’t have time for that right now. We need to just do what’s necessary for Story X and refactor that interface when we have more time.”

Unfortunately, when that happens, developers learn fairly quickly to work around it. I’ve seen several teams where unfortunately the conversation starts playing out like this instead:

Scrum Master (SM): “Ok, what tasks should we create for Story X?” Dev: “To start with, we need to refactor Interface A, and change all users of it to use a new paradigm.” SM: “Does that really need to get done for Story X? It seems unrelated.” Dev (lying): “Yes, it absolutely needs to get done for Story X. And I estimate it will take 20 hours.”

We call this “sneaking in technical stories”, where we get things done under the disguise of sometimes tangentially related user stories. It’s “sneaking in” because, in reality, these improvements — whether refactors or new technical features — are rarely necessary to complete the features. That’s the nature of technical debt: You can pay off the collateral, or you can keep paying interest on the debt. It is always cheaper, on any given story, to pay the interest instead of paying down the whole collateral. Paying down the whole collateral, in some cases, can take a significant amount of time — often the cost of paying down the debt dwarfs the cost of the actual story you’re estimating. It starts sounding very artificial and “sneaky” when a story should take 2 days, but in order to pay down technical debt you estimate it at 21.

I know that there are teams where the relationship between the developers and the product owner is good enough, and there is enough trust built up that when a developer says, “I really think we need to clean this up, but it will take an extra week,” the product owner says, “Ok, you know the timelines, I trust you to get it done.” But in the typical Scrum team, the odds are stacked against that relationship. A typical product owner is non-technical, so they see only features and timelines and find technical debt a fairly abstract concept. So given the choice between having a feature completed in 4 hours or 20 hours, with the only difference being the amount of technical debt removed from the codebase, the product owner, seeing a very real schedule constraint, will pick the 4 hour option, 9 times out of 10. Often they’ll try to strike a very rational bargain: “Let’s do the 4 hour option now, and we’ll try to get to the 21 hour option if we have extra time at the end of the release.” Then a separate technical story is created (“clean up interface A”) and it falls into the backlog, never to rise to the top. Knowing that this is going to happen, devs learn to lie and say it’s “absolutely necessary” in order to get it into the sprint. Or worse, they might just pad estimates with the debt in mind, and not add a task for it at all.

You could argue that saying “yes, this is absolutely required to complete Story X” isn’t that bad. The development team is responsible for the quality of the code, so if they feel it improves the quality of the code, it’s required. And perhaps it is a lesser evil, but I think it is problematic. One particularly nasty side-effect that I’ve seen is that, since developers feel like they’re doing the work “under the radar”, they end up doing it as quickly as possible — perhaps including shortcuts, or “I’ll have to deal with this part of the problem later”. They tend to want to understate the time they’re putting into it in standups. They might not document the new interface fully, or advertise it broadly, since it is being done undercover. It leads to sloppy work.

None of this is to villify product owners: the product owner doesn’t know the difference between a “nice to have” and “must have”, and devs can’t quantify the cost of paying the interest on techical debt vs. paying down the debt itself. So in my experience product owners will often entertain a few of these instances early in a release cycle, but then once a couple stories get ballooned up from 4 hours to 20 hours because of technical work, they loose their appetite. And to their credit, developers can always find more things that need to be refactored and cleaned up. So midway through every release cycle, the mandate comes down: “We just need to get the features done, we are on a tight timeline, no more technical stories.” No matter how closely I’ve worked with product owners, or how good their relationship with development is, it seems to end up in a similar place.

Larger refactors are even worse. There’s a class of technical work that is too large to conceivably fit within a functional story. Perhaps you foresee performance issues and you need to change your persistence strategy. That’s going to take R&D, and a fair amount of rework, and you’d like to do that before it’s a critical burning issue.

Or perhaps you’ve put off doing those medium-sized refactorings for long enough that you’ve got a pile of leaky abstractions or similar-but-not-quite-identical patterns that need to be unified, and it’s no longer a convenient sprint-sized task.

Now it has to be called out explicitly, and that relationship between the dev team and the product owners is tested further. I’ve seen very few nontechnical product owners who will prioritize this type of work into a sprint.

Then there’s non-functional features. These are standalone features, like a console for maintenance, which are not refactoring or technical debt, but are features with benefits to development or operations — someone other than the canonical business user that the product owner represents. Perhaps the tools will speed debugging, or make operations easier, or whatever, but they’re not specifically called out in a client SOW.

This has the same problem: with one team, and a nontechnical product owner, it’s hard to figure out how to put these into sprints. Because at the end of the day, “business value” is usually defined as “value delivered to clients,” where “clients” is always defined as “the set of business users that the product owner is most familiar with”. So you stack your releases with as many high-value user stories as you can — always a subset of what the client wants.

And again, I don’t say that to vilify product owners at all. It puts them in a tough spot. Imagine you’re a product owner: You have a story in front of you for a proof of concept for restructuring the data model of the app. It will take a week to prove the concept, and after that an indeterminate amount of time to implement for real, but it may give you a significant benefit in speed of development or speed of onboarding or performance some time later. You also have 20 client-facing features that you’d like to get into this next release. A product owner will rarely, in good conscience, put that technical story over the user stories that need to get into this release. If you’re in the unenviable task of fitting an agile process into a SOW-driven sales process, then even less so. So that proof-of-concept story will never get into a sprint. They’re not strictly necessary for a release, clients X and Y need features A through G in this release, so we’ll get to those “in the next release.”

There’s just no way for a product owner to prioritize possible future benefit over certain immediate benefit, even if the possible future benefit is significantly higher. All the cards are stacked against the technical stories: they have no deadline, they are not in an SOW, and they are intangible and uncertain. It’s not product owners’ fault — that’s just the way the system is structured. If they don’t get features A…G into a release, the client is unhappy and it’s their fault. But if they don’t get technical stories Q..Z into the sprint…well, we’ve lost a possible future benefit, but perhaps we can get to it in the next release. Since technical stories don’t have a deadline, we can always push them into the next release. Features, on the other hand, often have a deadline — in practice, there’s often agreements with clients, or roadmaps (however rough) posted in advance, or some other way to let users and potential users know what is coming. Given those two choices, I know which one I would pick, too. ALL the pressure is on getting the features done by the release date, for THIS release…not making future releases faster or easier.

Of course we’d hope that, somehow, the possible value of a technical story would outweigh the certain value of a feature story, eventually, if the possible value of the technical story was high enough. But I don’t think the system is set up to make that happen. I’ve seen high-value technical stories that have remained in various backlogs for years on end.

So while I understand that in an ideal world, you either express the value of large refactors or technical features and let the product owners prioritize them, or else let the technical team decide that it’s time to take a couple sprints off to tackle a large refactor while the product owner looks on obligingly. But that seems to be the exception rather than the rule.

There are alternatives, though. Some companies use 20% time as an alternate way of scheduling in this kind of work. Some use a flat ratio of technical to functional work. Some use a dedicated team that handles “framework” or “tools”-related work to remove at least that portion of the technical backlog.

If you’re in a situation where you hear things like, “in the next release, we’ll bite off a more reasonable timeline and prioritize these things in,” look a little closer: you might have this kind of systemic problem where technical work really can’t get prioritized in. Perhaps it’s time to look for other options.

  1. I’m using the term “technical” here to contrast it with “functional,” meaning something that the user will explicitly ask for.

  2. Perhaps this doesn’t happen on all projects. It could be that there are projects out there where these refactors are always be caught early enough that they’re still fairly small, and so are always fixed as part of run-of-the-mill ongoing refactoring. I’m not sure if that’s true or not, but I am sure that if you ever walk into an app that is large enough (say, a hundred thousand lines of code and up), there will be cases like this that need to be fixed. I suspect that as an application reaches a size where a single developer is no longer intimately familiar with the entire thing, you’re going to have duplication, repetition, and leaky abstractions. I’d love to think that the next app that I grow from the ground up will be diligently refactored intelligently enoguh and often enough that there will be no latent patterns that need extracting or larger refactorings or reorganizations that need to be done… but at the same time I won’t put money on it.

On to Octopress

I’ve just switched the blog over to Octopress. I’ve been using vanilla Jekyll for a while to keep up this blog, and I put a bit of effort into configuring it, building up a set of Bootstrap-based templates and so on. So, why switch?

At the end of the day, this is a utilitarian site. I’m not going to impress anyone with my design chops – I enjoy graphic design in theory, and I’d like to think that I’m not the absolute worst at it, but I do little enough HTML these days that it takes a long time to get things done, for end results that are never what I want them to be. And I’d rather put the time into writing actual content1. Meanwhile, my site wasn’t displaying terribly well on phones, and Octopress handles mobile devices well out of the box. And it’s beautiful to boot2.

So, I’m impressed so far. Thanks Brandon!

  1. Not that, in practice, I’ve been writing actual content.

  2. What a weird phrase.

In Defense of Easy

the setup

I watched Rich Hickey’s talk from Strange Loop called “Simple Made Easy”, which has been making the rounds recently1. He’s a great speaker, and always makes interesting points. But something in that talk didn’t sit right with me2. While his slide about Development Speed (“emphasizing ease gives early speed…ignoring complexity will slow you down over the long haul”) really struck a chord3, it felt like he was shortchanging “easy” in favor of “simple”4. This, then, is my long, somewhat circuitous exposition on that thought.

the case study

I’m working on software that I’ve been involved with for several years. Like a lot of projects, its evolved as the company’s (and their clients’) needs have changed, and it’s been written under schedule pressures and ever-shifting requirements – in other words, it’s like most software, as far as I can tell. I dream some days of VC-funded startups in their Herman Miller chairs building beautiful software from scratch with plenty of cash and plenty of time, but I suspect the grass isn’t really so green over there.

The application has gotten to the age where the layers of rapid development are starting to take their toll: new developers coming onto the project complain that the software is just too complex. By ‘complex’, they mean, “it’s hard to wrap my brain around all of the moving parts.” Some of the symptoms of this version of ‘complexity’ include:

  • it takes a very long time to get new developers up to speed
  • there are a fair amount of bugs caused by changes made to one part of the application which had downstream impacts on other parts of the application. Since most developers don’t have the “big picture” in mind, they don’t know they’re breaking things when they make changes.
  • writing new features takes a long time, since the developer needs to know a lot, juggle a lot, and dodge all kinds of pitfalls in order to put something new in place.

When I interviewed developers about where the problems were, I got answers like this:

  • You have to keep too many things in mind at any one time.
  • what different parts of the application are doing, upstream and downstream of the piece you’re looking at right now – how all the pieces fit together.
  • details of the domain: you have to have some familiarity of what the end goal is in order to understand what things are doing.
  • terminology: there’s a lot of terminology that is domain-specific, but it also isn’t entirely consistent throughout the application.
  • the pattern at work in the part of the application you’re looking at: different parts of the application follow different general patterns to do what they do.
  • the version of the code you’re looking at: since there are several clients using different versions of the application, some several years old, even if you’ve gotten your head around a section of code in one codebase, there is no guarantee that the accomplishment will translate to another client.

But, is simplicity the root issue? How do we fix it? And how would we prevent it in the first place?

theoretical aside

Here’s my limited understanding of educational psychology:
We’re built to only handle a small number of things in working memory (7 things, on average – like a phone number5). To understand more complicated things, we have to either “swap out” to longer-term memory or synthesize several related things into one well-understood thing.

Swapping out takes a long time, and is tiring…how many of us have been reading through technical documentation and gotten to a point where we said, “Oh, wait… the author explained what that term meant, or what that component did…what was it again?” That’s the effect of running out of working memory and having to look in long-term memory to find something you recently read. But more often than not, that factoid probably hadn’t been committed to long-term memory yet, so we had to go back and re-read part of the document. Synthesizing is the process whereby you understand a group of related things as a whole, so that you’re able to think of the whole as one unit, rather than the component parts. This takes time: you have to “wear grooves” in your brain in an identifiable pattern. For example, when we see a face, we don’t see individual facial features – ears, a nose, a mouth, and eyes – and then try to juggle all of these features to understand whose face it is. We recognize a face as an identifiable whole. Through many repeated exposures, our brains have synthesized that pattern as “a face”, and deal with it as a single item in our memory. That’s also why elementary school teachers drill math facts – if you have internalized that 8 x 7 = 56, you don’t need to store 8 and 7 in working memory and perform the arithmetic…“8x7=56” is a discrete whole in your brain.

For folks learning a new piece of software, especially a large software system, there’s a lot of individual pieces that have not yet been synthesized into understandable, well-understood wholes. And for many large software systems – mine included – getting to that point of familiarity with is more difficult than it needs to be.


Now that I’ve butchered psychology and complexity science, I’ll move on to butcher Rich Hickey’s talk. In case you haven’t watched it, here’s the rough paraphrase of the definitions that he bases his thesis around:

  • “complex” means combining two or more things (“complecting”).
  • “simple” means “concerned with only one thing”.
  • “easy” means “lie near” or “close at hand”. Relative to the speaker.
  • Conversely, “hard” means “(conceptually) distant”.

I think it’s fair to say that Rich felt simplicity is the important metric, while “ease” is superficial.

Meanwhile, it’s worth noting that Merriam-Webster’s definition of “complex” is, in part:

1: a whole made up of complicated or interrelated parts 2c : a group of obviously related units of which the degree and nature of the relationship is imperfectly known

(some unrelated definitions omitted)

complex or hard?

Here’s where things get dicey. It seems clear that making things simpler, in Rich Hickey’s sense, would be a real benefit. In this particular application, there are a lot of places that are combining “what to do” and “how to do it”. There are optimizations that have led to I/O-related code mixed into business logic, for example. And the level of abstraction is generally too high; our business logic has to concern itself with a lot of plumbing around “when” and “where”, instead of focusing on “what”. If each component focused only on the “what” of one portion of the business logic, it would certainly be easier to understand.

But it also seems clear that many of the developers’ complaints – “we can’t keep it all in our heads at once”, “we don’t understand how the pieces fit together”, and so on – aren’t really about “complexity” in Rich Hickey’s sense. They are in Merriam-Webster’s sense: “a group of obviously related units of which the degree and nature of the relationship is *imperfectly known*”. But by Rich Hickey’s definition, being hard to conceptualize is “hard”, not “complex”.

Perhaps there’s a broader way that the “single purpose” definition of simplicity could be applied to the application as a whole that I haven’t thought of. But at the micro level, lots of very distinct pieces with a single concern don’t necessarily make it easier to understand the big picture. In some cases, his suggestions may make that worse: a rules engine, for example.

Having important information (“should we follow path X or Y?”) abstracted out to a rules engine is, in Hickey’s terms, hard: it makes important parts of the application no longer near the code itself. So the code determines what, the rules engine determines why (or is it when? not sure) by, i assume, composing the “whats” into a control flow. And then when you’re trying to debug something, you have to go back and forth between the two. That’s simple, in that the two are separated. But the conceptual “surfaec area”, if you will, is much larger. You have to hold a lot in your head to understand the behavior of the application.

Perhaps generally, simplicity in this “single-purpose” sense help greatly when understanding the individual component, but make understanding the application as a whole harder (using, again, Rich Hickey’s definition of hard). But easy and hard don’t matter, right? They’re relative to the particular developer, they’re a matter of convenience…right?

Perhaps I am reading more into that dichotomy than I should. Casting things in black and white is a perfectly reasonable rhetorical device – especially in a fairly short talk – and that’s what Rich Hickey has done here. Exploring all of the subtleties distracts from the thrust of the talk, and simply doesn’t fit into that format. But I just want to explore what those subtleties are. Because in the case we’re looking at here, “easy” matters if it’s making things hard enough to understand that you have to work on the application for years before understanding enough to add new features. And they certainly matter when there’s only a couple developers who can get paged at 3 AM when things go wrong.

There are real advantages of “easy”, where “easy” is defined as “near to our understanding/skill set”. Less conceptual weight means more working memory can be applied to the problem at hand. Less conceptual weight means less effort expended on the tool itself, more effort to expend on the problem at hand. That is inherently pleasant for engineers – it is very frustrating to put a lot of effort into the tool instead of the problem. On the flip side, once a tool is mastered, it is very, very pleasant to be able to whip through a problem efficiently: see Vi vs. Notepad.

The argument is similar to the argument that Scala proponents make6: eventually, the concepts that are currently very conceptually weighty (so many things to keep in mind!) will eventually be internalized and synthesized, and then it is better, because you can attack your problems better. The question is, is the frustration of the synthesis time (the “learning curve”) worth it?

So perhaps my argument is one of degree. When understanding the application becomes hard enough that the ramp-up time for new developers becomes untenable for the developers themselves – not just the managers that want an easily-interchangable workforce – then steps need to be taken to make things easier.

fix it! fix it!

So, if hard matters, what can we do about it?

In this particular case study, there are several things that would help, and everyone reading this far has thought of several more. In fact, I’d be willing to bet that everyone reading said, “Well, clearly, what you should be doing is —-“, where each reader will fill in that blank a different way. That’s because the world of Software Engineering has come up with all kinds of ways to make large systems easier to deal with. Here’s a few:

visualizing the abstractions

Visualizing the various related portions together saves you from having to hold parts in memory as you look at other parts – it’s all available right in front of you. Seeing how things work together also helps you synthesize them into a whole. To get a clear picture by looking at the code, you have to have internalized what each of the components do and how they connect. And since there’s far more than 7 components in those networks, they don’t all fit in working memory. So until you’ve internalized and synthesized the components and how they interact, it just feels overwhelming.

Visualizing the flow of data through the system is a similar problem. We have sequences of jobs that read various values from the database and then insert or update other values. There’s no way to visualize at a high level what components set data and what components are reading that data.

When reading through the code, I’m essentially trying to construct a mental picture of how it’s all fitting together. Add in any abstraction layer, like a dependency injection framework, and things get even harder: “what components do” and “how the components fit together” are defined in two different places (Java files and Spring context xml files, for example, for the Spring framework). If I could somehow visualize it all on one screen, it would save me the effort of constructing a mental picture… ideally I would be able to see both what components do (popup or mouseover descriptions?) and how they fit together. That would be a lot less to juggle in working memory.


In cases where the pieces of the application need to fit together but developers aren’t likely to hold all of those pieces in memory, defined contracts are an age-old solution. The theory is, you don’t have to hold it all in memory; just the compoent you’re working on, up to and including the contracts it has with neighboring components. How they fulfill their contracts is outside your concern.

Whatever form those contracts take, you need to be able to explicitly specify (in an automatically-verified way) what each component expects from others.

In the software in question, there are contracts in place between components…why those aren’t making things simpler for developers is a subject for another blog post.


If everything followed the same pattern, there you would only have to learn that pattern once. Then hopefully, like a face, every new job you saw would just look like a different but recognizable instance of the same pattern.

Consistency also helps in tooling – if everything fits the same pattern, you could create visualizations that worked for every use.

Of course, you can only be as consistent as your actual patterns of use. In this case study, there’s not yet a single pattern that meets the needs of all parts of the system. Some are iterating over data elements and performing relatively simple calculations on each, while others are constructing huge in-memory graphs of time-phased data and using those for some rather complicated calculations. With some more thought, though…

Another form of consistency is a universal vocabulary: if everything used the same term for the same concept, you’d only have to learn the definition once. Terminology was specifically called out as a barrier to understanding. “Domain-Driven Design” practitioners stress a consistent “ubiquitous language” for this reason. Honestly, I’m not sure how big of an issue this ends up being; it is a huge early barrier for new hires, but one that fades pretty quickly.

comprehensive integration test coverage

This doesn’t really help in understanding the system, but does give developers some level of confidence that, while they don’t understand the whole system, they’ll know if their change has broken it. Depending on the tests, they can potentially serve as a form of documentation. And of course it has other benefits as well – like, having components that are tested.

And of course, the classic


As a last resort, documenting preconditions and postconditions, defining terms, explaining patterns and paradigms in static documentation can help.

Documentation is a distant last place. At best it can help you internalize some piece of the application, but in practice it will just presenting data that you then have to hold in working memory while you try to reason about the system. You as the reader then have to go through the effort of reading and reviewing it in order to try to commit it into long-term memory. The more complicated the component or concept is, the less useful documentation is.

Documentation is also out-of-date as soon as it is written, it’s never maintained, has to be discoverable for the person trying to learn about things, and is high-effort for both the writer and reader. I secretly suspect that, if we could track the number of times that any given page on our internal wiki has been really read, the average across all pages would be < 1. We write more documentation than we read.

There are many, many more. Feel free to tell me your “obviously, you should be doing —!” in the comments.

finally, a conclusion

I think Rich Hickey’s division between “simple” and “easy” is useful and instructive…but I think that many of the things that we care about in designing simpler software that falls into what he would call “easy”, rather than “simple”. And if you have a nontrivial number of components, with interactions that will not easily fit within a developer’s working memory until they’ve done a significant amount of synthesis… don’t discount easy. It can save you from that 3AM phone call.

  1. I started writing this in late 2011, when it was in fact “recently”. It’s now old news, and firmly in the canon of tech talks to watch. Since then, he’s given another wildly popular talk at RailsConf called “Simplicity Matters”. I won’t be referencing it here, but it’s worth your time as well.

  2. I’m well aware that, when I disagree with Rich Hickey, the chances of me being right aren’t very good.

  3. Yes, a chord. Which, being two or more interwoven notes, is complex. But that’s ok.

  4. Perhaps someday, I’ll distill it down into a shorter, clearer message. Maybe even a couple dozen powerpoint slides. And maybe then, someone else will point out that the broad brush I used for brevity covered over some other subtlety that they found very important…

  5. The famous “Magical Number 7, Plus or Minus Two”

  6. I like Scala, for the record.

Takes Too Long to Build Your Project? Try Harder.

So, we have a problem at work. Well, more than one, perhaps…but one of them is that builds take too long. I’ve talked to folks I know in different places, and this is a pretty common complaint. But of course, we have it tough. We have a large codebase and a lot of tests. The compile time for the codebase is one issue; I’ll get to that in another blog post. But the test time is worse.

Most of our tests are proper unit tests: they test a single unit of code, mock out all components other than the one under test, and run quite quickly. But we have a significant minority of tests that go against a database, and the development databases live out on EC2 instances in the Amazon clouds. The time to run all pure unit tests across the project is pretty significant, but the time to run all the unit tests on a developer machine is untenable. Running tests for one module (out of 10 or so) takes half an hour when the code is on a developer’s machine and the development database is on an EC2 instance.

Whoa, hold on there – I can hear some of you now: “Tests that go against the database!? Those aren’t unit tests! Unit tests CAN’T TOUCH ANYTHING! AT ALL!” Ok, fine. They’re not unit tests. Call them integration tests. Call them whatever you want. The fact remains, there is logic contained in the DAO layer, and in the queries themselves (some of the SQL queries are substantially complicated). Somehow or other, that logic needs to be tested.

So, there are a few options I know of:

Mock out the database itself

This would have to be a complete enough fake-database that we could test the queries themselves and catch unbound bind variables, bad column names, and things like that, while retaining full Oracle compatibility. I haven’t seen anything that actually does this, but if you know of it, let me know. In-memory databases like H2 are often used as a substitute for the database while testing, but they don’t work for us because we use a lot of Oracle-specific SQL extensions (partition clauses, SQL hints, and so on). Even Oracle’s own TimesTen database doesn’t support these, unfortunately. If they did work, they’d be a nice option, though.

Cut and run: Mock out as much DB interaction as possible, and put the tests that do hit the DB in a separate group that run only occasionally.

We tried this first. I created a Maven profile, so that adding -P quick to the Maven command line skips all the database tests. I could have done it the other way around, so that you had to add a profile to run the database tests, but I was afraid we’d never run them. I had nightmares of folks setting up CI jobs for a new branch and forgetting to add the magic “run the database tests, too” parameter, and not seeing database-related errors until it got out onto a server. Yes, we have a QA process after unit tests, but if a bug makes it to that layer, it’s made it too far; it’s holding up other testing. I may still move the tests that hit the database into the integration-test phase, but that feels like a hack; they’re not really integration tests.

There’s also something that makes me uncomfortable when there is a disincentive to writing tests. In an ideal world, we’d be able to say, “Write all the tests you want. Load all the test data you need. We’ll run them.” So the “quick” profile was OK, but I knew we could do better. And it just happened that, elsewhere in the company, some folks were working on another project:

Move the data to the computation.

The team that maintains the development databases created a VirtualBox image of a database server, so that developers could run it locally. No more network lag, and much faster tests (about 2x faster, in early benchmarks). But there were some problems: developer laptops are only so fast, and they have only so memory. I’d love to say that each devleper could have a separate desktop workstation to use as a development database, but unfortunately the budget isn’t there. And while we have 8gb of RAM on our Macbooks, after running a local version of our app, some development tools, and Firefox, we’re often hitting swap space. So… the dev VM is less than ideal. If I could get 16gb of RAM on my machine, then I’d be in business. But the ops team balked at that. Something about not being officially supported by Apple. So that leads us to…

Move the computation to the data

Hadoop does it, why can’t I? I think the first few options have a lot going for them, and are probably enough for a lot of environments. But if software gets big enough, there’s always going to be a point where the tests take too long to run locally. And as fast as your development machine is, you can always get a cluster of build servers that are faster. And that’s probably always going to be more economical than giving each developer a beefy PC. So it’s not hard to imagine eventually wanting to do your test builds out on the CI machine.

We use Hudson…er… Jenkins as a CI server already; it builds the proejct after each checkin. So all we needed was a way to make it build the project BEFORE a checkin – what a lot of people call a “try” server (as in, “try it before you commit it”).

I did a bit of searching, and a lot of smart people have solved this problem already: Etsy, for example, uses Jenkins as a “try” server. Their solution is clean and simple: just send the diff of your code against master, and Jenkins will apply the diff, compile, and build. Nice. Most other solutions seem to take similar approaches: Mozilla, buildbot (used by the Chromium project, among others)… and so on.

Unfortunately, we have a lot of modules and a lot of branches. The branches are unavoidable – we support a number of clients. The number of modules…well, that my fault. I split the software into a lot of modules several years ago to try to encourage code reuse between projects, but the cost has been much higher than the benefit. Unfortunately, it’s now costly to merge things back together – but that’s another post for another time.

So, rather than setting up one “try” build per branch per module, doubling the number of CI jobs, I opted to do things differently. I created a “” script that runs the “compile” and “package” portions of the build, skipping tests, and then sends the compiled classes and test classes over to Jenkins. The Jenkins job just unpackages the classes and runs the enclosed tests.

The Jenkins job, in this scenario, is entirely generic – it knows nothing about the code its testing. It doesn’t connect to source control at all. It doesn’t even see the source – we’re sending it compiled class files, and it’s just running the tests.

Is this the right solution for you? Perhaps, if your constraints are similar to ours. Read the Etsy post linked above, and see if that solution seems simpler: it may well be. If you want to set up something like we did, read on.

I’ll say upfront that this method has some disadvantages. The main disadvantage is that it’s slower. The traditional try server does a diff of your local code vs. the mainline, then sends that diff up to the server and applies it, then runs the build and tests up on the server. So all that goes over the wire is your diff, and all your machine has to do is diff the code – all building is done on the CI server.

My approach, instead, does the building on your local machine, and sends the artifacts to the CI server. This is a compromise: your machine may or may not build faster than the CI server (probably not) but the artifacts are likely a larger upload. So, if in doubt, try it out and see if it works for you.

Setting it up is easy. Since we use Maven, we just have to make sure each of our Maven builds creates a test-jar:


That snippet is in a parent pom, so that’s taken care of for all of our projects.

The script looks like this, currently. It could be a lot prettier, but this was the relatively quick-n’-dirty experiment:

    # The idea behind this script is to build and package up your project locally, 
    # then send the code (the classes and the test classes) to the Jenkins server
    # to be executed there. 
    # This relies on the build being configured to generate a "test-jar" artifact
    # It has some limitations: 
    #   * You're not going to have exactly the same classpath (yet)
    #   * It can't handle multi-module projects (yet)
    #   * You can't add Maven profiles to the build (yet)
    set -u   # fail on unitialized variable
    set -e   # fail if any command exits with a non-zero status
    # Find our local directory, so we can locate the script later
    DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
    # This is set in Jenkins' configuration; you probably have to configure Jenkins to have a static port. By default it changes every restart.
    # clear out old builds, so we don't upload the wrong thing
    rm -f target/try-$USER-*
    mvn -Pdeva -DskipTests=true $@ package  
    _date=`date -u +%F.%H%M%S%s`
    # This would be the simple version: 
    # tar czvf $_filepath pom.xml target/*.jar  
    # ...but that only handles single-module projects.  This handles multi-module projects: 
    find . -name "pom.xml" -or -name "*.jar" | xargs tar -czvf $_filepath
    echo "created try package at $_filepath"
    echo "now uploading to Jenkins..."

    #Backgrounded; we don't want to wait.  The script will notify on completion using Growl.

The “” script just looks like this:


    set -e 
    set -u 
    ssh $HOST -p $SSH_PORT build try-build -p filename=/tmp/$FILENAME -p username=$USERNAME -s  
    # This is also an option, but jenkins-cli doesn't handle an encrypted private key, and who leaves their private key unencrypted?
    #java -jar jenkins-cli.jar -i ~/.ssh/id_dsa -s $1/job/try-build  build try-build -p attempt.tar.gz=$2 -p username=$3 -s  
    # This would be an alternative to using private keys for jenkins-cli.jar, but you'd need to enter a username and password 
    # --username `git config` --password (something entered at the command line)
    echo "Job has been started on Jenkins. See http://$HOST:$HTTP_PORT/job/try-build for status."
    # Did the build succeed?
    if [ $? -eq 0 ] 
      growlnotify "Try completed successfully"
      growlnotify "Try failed!"

The Jenkins job is a basic build with two parameters, one called “username” and one called “filename”. The build runs this script as its first step:

    echo "Building try number ${BUILD_NUMBER} for user ${username}"
    # This shouldn't be necessary -- Jenkins should keep each build separate -- but it seems that it wasn't the case. So we'll keep each build in a subdirectory by build number, which is easy enough. 
    mkdir $_dirname
    tar  xzvf ${filename} -C ${_dirname}
    cd ${_dirname}
    # unzip each of our -tests.jar files into a test-classes directory
    find . -name "*-tests.jar" -execdir unzip -o {} -d test-classes \;
    # unzip the normal jar files into a classes directory
    find . -name "*SNAPSHOT.jar" -execdir unzip -o {} -d classes \;

It then invokes a top-level Maven target: surefire:test -fae. This runs the actual Maven tests, but without any build step (since we’re invoking the surefire:test plugin explicitly and not the “test” phase. -fae tells it to fail at the end instead of as soon as it sees a test failure; it’s entirely optional.

There’s a lot of potential improvement. For example, right now it’s hard to tell just by looking at the Jenkins build queue who’s try is building. We could fix that within the build by using the description setter plugin. I haven’t tried that yet.

Try it out, let me know what you think. Or, if you have a better idea of how to get around long database-bound tests, I’m all ears.

Greasemonkey Is Underrated

I’ve heard about Greasemonkey a few times over the past few years. I remember reading a few people wax eloquent about it, mainly how great it is for changing the layout of web pages (like a more convenient Youtube page, or even changing Google News back to its old layout). And I remember the big security kerflaffle a number of years ago (wow, 2005 – has it been that long?). But I’d never had occasion to use it.

But when I started browsing real estate listings in earnest, I got my excuse. I like Zillow and Trulia, but the realtor I am working with has a website of his own that he prefers – clients can save listings as “favorites” and “possibilities”, make notes, and he can make notes or suggestions in return…which makes good sense. But the site is no Zillow or Trulia: it displays far less data, it’s a bit slow, and it’s fixed at 10 results per page. And when there’s over 100 listing my price range to sort through, every little bit helps.

Greasemonkey can’t help the site speed, but I’ve made some inroads on how much data it displays. I hacked up a little Greaseonkey script that runs in Firefox everytime I load the realtor’s search page. The script just goes through each listing and fires off a query to the Zillow API, returning Zillow’s value estimate, the historical min and max value of the house, and a direct link to the listing’s page on Zillow. I may add a couple other things, like tax value and last sold date (is it a flip?). I was excited about adding some neighborhood statistics, too, but unfortunately Zillow doesn’t have neighborhood data for my area.

Up next: look into the Trulia’s API (I haven’t checked if they have neighborhood information) and maybe Walkscore, Panoramio and Flickr (for neighborhood pictures). After that, some Google Map hacks…but that’s a bit more involved.

If anyone else is considering pimping their webpages – I highly recommend it – I recommend Mark Pilgrim’s Greasemonkey Hacks. I heard some disparaging comments about the docs for Greasemonkey, but that page is great. Several of the other docs on Greasespot’s wiki are worth reading as well. Those of us that are only passable Javascript programmers can be confused by Greasemonkey’s weird sandbox environment, so the documentation is necessary (objects in Greasemonkey are all wrapped with XPCNativeWrappers for security, in response to the security issues from 2005…). I don’t know if it’s mentioned in the documentation, but the newest Greasemonkey version has a “create a new script” menu action that makes getting started dead easy.

As an aside, I have to say that after taking the Stanford’s online Machine Learning class this past fall, I was excited about doing some data science to help in the real estate search (“how does the addition of light rail affect nearby housing prices?”). Predicting home prices was actually one of the driving examples through the class. But I’ve been disappointed in how difficult it is to be to get access to large sets of real estate data in the US. I guess I naïvely thought that since tax valuations and home sales are public record, there’d be a way to access a sizeable dataset. If it’s out there, I haven’t found it. And the listings themselves are almost all closed, living inside each area’s version of the MLS. The closest thing I’ve found are the API’s offered by Zillow and Trulia, but at least Zillow’s terms of service prevent you from storing their service responses locally, and they don’t return MLS results (i.e. most results added by real estate agents) in their search results via the API…you have to ask for the properties individually.

So I’ll have to play around with big data in other ways…

Cluster Management and Task Distribution, Part 3


This is a followup to my Zookeeper vs. Jgroups post, and the Hazelcast addendum.

I ended up needed caching before I needed cluster management, so after evaluating a few options I implemented a caching facade that could switch between Hazelcast and Infinispan (which is based on Jgroups). I’ve been really impressed with Hazelcast so far – no major problems. We did hit a bug with 1.9.2, but found that a patch release with a fix was already out.

It also has a convenient cluster service built in, so for our cluster management needs, it’s likely to win for us, provided we don’t run across any more problems in testing. The interfaces are easy and intuitive, and I don’t really have anything to complain about – except that there aren’t many heavy hitters that are openly using it.


For our particular set of requirements, Hazelcast appears to be the winner:

  • it provides several things we need (distributed data structures, cache, cluster management)
  • it’s easy to use
  • it’s easy to hide behind a set of facades so that we can switch it out if necessary
  • it’s very simple to configure for different client scenarios (it works on EC2, for example)
  • it’s worked well in testing
But if our requirements were different, I think Zookeeper would have been my choice. Specifically:
  • If I were coordinating more than a handful of servers – say, 50 or more – I’d go with Zookeeper. It was specifically made for coordinating large numbers of servers reliably. And it’s now the de-facto standard for that purpose: Facebook, Twitter, LinkedIn, and a lot of others are all using it for node management across large numbers of nodes. At that scale I don’t want to roll my own if I don’t have to.
  • If I owned installation and maintenance. The deployment complications of Zookeeper are one of the main downsides for our purpose; it’s another server to install and manage, which raises the operational footprint of our app. If we were running the app only within our own datacenter, no big deal – there are Puppet or Chef recipes to install it, and even pre-built EC2 images. That’s a very manageable hurdle. But we’re handing the app over to clients to install and manage, and Zookeeper adds more hardware requirements and an installation process that is good for at least a few more pages in the install guide. Hazelcast, on the other hand, adds just one line to a property file (cluster.nodes=x,y,z). Be kind to your ops teams.
There are probably also cases where JGroups is the right answer. It’s main advantage is its configurability – you can put together a nearly infinite combination of protocols stacks to suit various purposes. Check out the comments of my first post, for example, for sample code to handle node discovery in a non-multicast (e.g. EC2) environment. But if all you need is cluster management and some pretty typical services on top of that, there are simpler options. I’m going to be honest – after looking into Zookeeper closely, I’m looking for excuses to use it. But I don’t think that will happen right away.

Cluster Management and Task Distribution Part 2: Caching

The question

Which technology stack makes more sense for a distributed caching framework?

The backstory

I’ve gotten a fair amount of hits on this post comparing JGroups and Zookeeper for task distribution and cluster management. I wanted to write an update of what I’ve investigated so far… unfortunately, my free time to investigate this specific issue has gone kapoof. Getting rid of the single point of failure in task distribution has gone on the backlog again.

However, in the meantime, I’ve done some investigation on distributed caches. This was the same app, but a different need – I needed a quick, practical way to distributed data across the cluster, so I tested Infinispan and Hazelcast.

Infinispan uses JGroups behind the scenes to do its clustering; Hazelcast has its own homegrown set of protocols.

Zookeeper, for all its pluses as a cluster management option, doesn’t have anything to offer for this use case. And since distributed caching came up first as a production requirement, Zookeeper may lose out by default; either of these caches can also be adapted for cluster management and task distribution (in the form of a fully-replicated cache, if nothing else). So if they’re already in the stack, and using them isn’t too much work, that’s likely what we’ll use. It frustrates me to have to make technical decisions this way, but the way that our timelines go, that’s often the way it is.

Why not Terracotta?

Terracotta is often at the top of the list when talking about distributed caches, so why wasn’t it here? Three reasons:

  • This particular client has had bad experiences with Terracotta in the past, and was resistent to installing it here.
  • Terracotta requires a server component, which is more operational overhead. It also makes it more difficult to transparently switch between clustering technologies.
  • Terracotta’s license (at least, last I checked) is sort-of free, but with a stronger-than-normal attribution clause; adding commercial license cost wasn’t possible for this minor release. The attribution clause wasn’t entirely a deal-breaker, but it’s certainly less appealing.

The first reason there is more significant than the other two. Terracotta may still end up as a distributed caching option in the future.

Unfortunately, this isn’t as in-depth or thorough of an investigation as part 1. I’m not going to go back through the requirements and then see how the candidates stack up. Infinispan and Hazelcast, our two main candidates, have a very similar feature list on paper, so it came down to a hands-on investigation instead.

The plan

I already had a generic cache interface that the app was using, so my goal was to write an implementation for both Infinispan and Hazelcast and allow us to switch between the two at startup time. That proved to be a bit of a trick – both have Java configuration options (instead of XML-only configuration) in their most recent versions. Both have some tricks to them.

My goal was to configure either cache with just two properties:


The first obviously tells the app whether to use Hazelcast or Infinispan. The second points to at least some of the other cache members. That’s necessary because multicast – which is the default discovery option for both of these products – isn’t available on EC2, which is one of our possible deployment options. For Hazelcast, the list of nodes doesn’t need to be a complete list; it will use the listed nodes as seeds for a complete list. Infinispan, on the other hand, is more unclear.


I ran into a few issues. For example, in Infinispan, my clever code that parsed the address list, found if one of them applied to the current server, and set the bind address appropriately ran into some snags. Netstat showed the following:

tcp        0      0 ::ffff:  :::*                        LISTEN      

It was binding an IPv6 compatibility address. And my “separate port from address” code isn’t smart enough to handle IPv6 addresses, so this wasn’t going to work. I worked around it (in this case, I hardcoded out IPv6 addresses…that’s not on my roadmap right now) and eventually got both working.

Writing the code that actually used the distributed caches was painless. The implementations of my simple cache interface came together without any problem; both implement both ConcurrentMap and JCache’s Cache interface, so they both fit in easily.

Unfortunately, I was still running into Infinispan issues when a node died and then later rejoined the cluster.

Hazelcast, on the other hand, worked out of the box in this case, so that’s what is going on into testing at the client site.

All in all, Hazelcast surprised me. It seems to be gaining acceptance on the web: anecdotally, posts on Stack Overflow from two years ago tended to say “What’s Hazelcast?”. Last year they would say “Hazelcast? That’s unproven”. This year, when it’s showing up on recommendation lists. Granted: small sample size. But that’s what we have to go on. In my testing, it works out of the box and does what it’s supposed to.

Now, it’s worth noting that adding Hazelcast – or any distributed caching – doesn’t come without a cost. Unfortunately, I don’t have that cost quantified, although I’ll try to update this post with measurements if people are curious. But the time spent writing data to and reading data from distributed cache often slow down the application to the point where two machines with a distributed cache are slower than one machine without1. The break-even point depends on the technology, the workload, and the hardware, and how careful you are with what is written to and read from the distributed cache. The higher the consistency and durability guarantees of the cache, the more painful that cost will be.

It’s probably worth digging into this cost briefly. It might seem at first glance that this shouldn’t be the case: after all, you’re just storing data in memory, and when a machine on the cluster needs some data, it asks other machines before going to the database. As long as getting the data from another local machine is faster than going to the database and the cache hit rate is reasonably high, it seems like it should be faster. Alas, it is not the case.

It doesn’t work that way for a couple of reasons: first, because you don’t have any durability guarantees if, after calling cache.put(), it just lives in that first machine’s local memory. If that machine goes down, the data is lost. So most caches (including Infinispan and Hazelcast) will, by default, block the put() call until that data has also been written out to at least one other machine. Of course, if you don’t need that durability, you have options – you can turn off replication entirely, or let it happen in a write-behind queue, so that there’s only a limited window of time in which the data lives on only one machine.

The cache is also likely to be slower because attempts to be clever about which machine holds each element in the cache. It tries to distribute cached data around the available nodes, as well as offer a quick way to compute which machine owns each piece of data, so that it doesn’t have to broadcast a “who has cache key X?” every time you do a cache.get(X). By default, it does this using a hash function on the cache key. That means that, when you call cache.put(..) on machine A, there’s no guarantee that the data will stay on machine A. So after calling put(), you may have to wait for the data is written to the machine that is declared the owner of that value, based on the hash of the cache key. It’s worth noting that Terracotta doesn’t work this way – data ownership is a function of the node that last used the data. But Infinispan and Hazelcast both do.

Now, those who are used to distributed systems, this will be old hat, but it surprises a lot of people. Two machines is very often slower than one. That’s why folks moving to Hadoop and HBase or other dedicated clustering datastores designed to scale to dozens o hundreds of machines find that they don’t start equaling single-machine performance until they move to several machines. Anecdotally, that is often around 5 machines, although it depends heavily on the workload – how much data is being written to and read from distributed storage vs. other uses of time (CPU, other I/O).

So, the moral of the story: be very careful what’s going in and out of cache. “Transparent caching” cannot be – it has a very significant cost over and above the cost of just writing the blocks of local memory, UNLESS you remove enough consistency guarantees that it can happen less obtrusively after the fact.

In related news, Akka 2.0 will gain the clustering support that’s currently reserved for the commercial offering (“Cloudy Akka”)… I’ll be interested to see what that looks like.


I’ve split the conclusion into a separate post, here.

  1. If you haven’t seen it, check out the “numbers every computer engineer should know” from Jeff Dean’s talk “Building Software Systems at Google and Lessons Learned” at Stanford (slides, video). It has some really useful order-of-magnitude values you can use for back-of-the-envelope estimates of the cost of distributed caching.

Cluster Management and Task Distribution: Zookeeper vs. JGroups

The question

Which technology stack makes more sense for a distributed task-management framework?

The backstory

The target app I’m working on at the moment is a task-management framework: it distributes discrete units of work (tasks) to workers which may be spread across multiple servers.  Right now, it’s a single-master system: we can add as many nodes as we want (at runtime, if we want) but only one can ever serve as the “master” node, and that node can’t fail.  The master needs to keep track of a bunch of tasks, which have a hierarchical structure (tasks can have one parent and 0-n child tasks).  It manages the tasks’ interdependencies, checks on the health of the worker nodes, and redistributes tasks if they time out or if their worker node goes down. A lot of task-management systems put tasks in one or more queues (Resque, beanstalkd, etc.); that’s what we did as well, in our first version.  And this works well; you can model a lot of problem domains this way.  But our task data model is a bit more complex than that, though: tasks can have children, and can define whether those children should be executed in serial or in parallel.  So you can define that task A must execute before task B, but then tasks C through Z can execute in parallel.  When Task B executes, it can decide to split its workload up into 100 smaller tasks, B1 through B100, which all execute in parallel, and would all execute before C starts.  So the tasks end up looking more like a tree than a queue…modeling them as a queue gets a bit convoluted.  You could argue that the best answer would be to re-model our tasks to fit within Resque, beanstalkd, or the like, but at this point that’d actually be more work. And you can’t deny this is fun stuff to think about… So the single point of failure needs to go.  I’m looking at some of the options for instead distributing the data (the tree of tasks, who’s executing what, whether the nodes are healthy, whether a task has timed out) across multiple servers.

The requirements

The task tree should be shared between more than one server, with guaranteed consistency.  I.e if server A is serving as the master node and it goes down, another server needs to be able to pick up and continue, with the exact same view of the current task tree.  We don’t want tasks to accidentally get dropped or executed twice because server B’s version of the task tree wasn’t up-to-date.

Nodes have to be able to come online and participate in the cluster, and then drop off again.  The full list of nodes can’t be hard-coded at startup.

Ideally, we should be able to handle a network partition.  We can do that at the expense of availability: that is, if there are 5 nodes, numbered 1 through 5, and nodes 1 and 2 get separated from 3 through 5, it’s OK for 1 and 2 to realize they’re the minority and stop processing.   The following caveats apply:
  1. We need a way to handle an intentional reduction in servers.  That is, if we spun up 10 extra cloud servers for a few hours to help in peak processing, we’d need a way to tell the cluster that we were going to spin them back down again, so the remaining nodes didn’t think they were a minority segment of a partitioned network.
  2. In case of a catastrophic failure of a majority of servers, I’d like a way to kick the remaining nodes and tell them to process anyway.  But I’m ok if that requires manual intervention… that’s an extreme case.
I’m targeting 100 nodes as a practical maximum.  In production, we’ve used up to 5 nodes.  I don’t expect us to go above 10 in the near future, so I’m adding an order-of-magnitude safety margin on top of that.  Of course, when the task management framework goes open source, 100 nodes will be quite feasible for users with different problem domains and deployment patterns.  In our typical deployment scenarios, customers prefer fewer, larger servers.

 The clustering option has to be able to run on EC2 instances.  It can’t depend on IP multicast.

Deployment simplicity
  As few moving parts as possible, to make integrating and deploying the software as simple as possible. Our app is often deployed by the client on their hardware, so we’d like to make their lives as easy as possible. And as the framework goes open source, of course, it’s going to be more helpful for a more people if it has a low barrier of entry.

Implementation simplicity
 The less code I write, the fewer bugs I introduce.

Event bus
 Nodes can also pass events to other nodes, typically by multicast.  We use JMS for this currently, and we can continue to do so, but it’d be nice to get rid of that particular moving part if possible.

There is not necessarily a requirement to persist all tasks to disk, in order to keep them safe between server restarts.  In our previous version, tasks were kept in a JMS queue, and we ended up turning disk persistence off – in practice, in the case of a failure that requires manual intervention (for example, a JMS broker failure, or the failure of all nodes that had tasks in memory) almost always means that we want to manually restart jobs after the system comes back up – possibly different jobs, possibly a reduced set of tasks to make up time.  We found that we rarely want to start back up exactly where we left off. So, if the solution automatically persists to disk, I’ll probably create a way to bypass that if necessary (a command-line flag that lets us clear the current tasks on startup, perhaps).


Consistency Zookeeper pushes everything to disk – writes are always persisted, for safety. Zookeeper is designed for cluster coordination, and it’s well-thought-out.  It has documented consistency guarantees, lots of documented recipes for things like leader election, and supports subscription of listeners on tree changes.
Elasticity We’d have to pick three servers up front that would serve as Zookeeper nodes as well as normal nodes. Those nodes woulnd’t be elastic, but other nodes could come up and down without a problem.
Partition-tolerance The “live” partition would be defined as “the partition that can still connect with the Zookeeper cluster”.   I have to check on how Zookeeper handles network partitions; I believe it’s just a quorum-based algorithm (if one node is separated from the other two, it stops processing).
Scalability Three Zookeeper nodes could support 100 nodes without a problem, according to anecdotal experience.  At that point, we would have the option of moving the Zookeeper nodes onto dedicated hardware. Having that option is the upside to Zookeeper’s deployment complexity: you can separate them, if there’s a need.
EC2-friendliness Zookeeper uses TCP; it’s been used on EC2 before.
Deployment Simplicity The main downside for Zookeeper is its operational requirements:  two classes of servers (with Zookeeper and without), potentially separate server instances running on Zookeeper-enabled servers, fast disk access required (according to the ZK operations manual, two fast disks – one for logs, the other for … er..something else.)  That’s a significant increase in operational complexity for a product that is distributed to and maintained by clients.  That also means work necessary for us to make it as turnkey as possible.A Zookeeper-based solution would pick three or five nodes to be Zookeeper nodes as well as task nodes.  Ideally, for Zookeeper’s purposes, it’d be three dedicated servers, but that isn’t going to happen in our architecture.  So Zookeeper will have to coexist with the current app on servers where it’s installed.  For deployment simplicity, I’ll probably need to come up with a way to start up Zookeeper automatically when the server comes up on those nodes. Zookeeper used to not play well as an embedded app (according to some reports from the Katta project); that may be fixed now, but if not I may need to put that logic in the bootstrap script.
Implementation Simplicity Zookeeper would provide a single, shared view of the task tree.  The system would be up as long as the majority of Zookeeper nodes remained up; any node could be leader, as long as they could connect to Zookeeper to view and update the shared state of tasks.   Other nodes could potentially go straight to Zookeeper to query the task tree, if we were comfortable with losing that layer of abstraction.  Either way, it would be simple implementation-wise.
Event bus Zookeeper doesn’t provide events…unless you write events directly to Zookeeper.  It could, however, obviate the need for a lot of events.  For example, we wouldn’t need a “task finished” event if all nodes just set watches on tasks…they’d be automatically notified.  Same with hearbeats, node failure, and so on – they’d be taken care of by Zookeeper.
Zookeeper doesn’t list AIX as a supported environment, which is interesting.  We do have to support AIX (unfortunately).


Consistency JGroups is a lot more of a DIY solution, although it supports guaranteed message delivery and cluster management out-of-the-box. With guaranteed message delivery, we could have consistent in-memory views of the task tree without too much effort; we wouldn’t need to write something that also persisted to disk.
Elasticity I have to look into how elastic JGroups can be without IP multicast.  I know it can work without IP multicast; I just don’t know exactly HOW it works. updated 01 March 2011: JGroups has a discovery method called TCPPING, in which you predefine a set of known members that new members will contact for membership lists. It also has a standalone “gossip router” that performs the same purpose. I’ve played with this; it works on EC2 without problems.
Partition-tolerance There are options. JGroups has a merge protocol, but it does require application-specific code to determine what exactly to do when the partitions come back together.
Scalability More speculative, because JGroups has the option for infinite combinations of protocols on its stack.  Worst case, the cluster needs to define two classes of nodes, similar to the Zookeeper implementation: “nodes which can become leaders” and “nodes which can’t.” But JGroups has been deployed with hundreds of nodes.
EC2-friendliness JGroups can be used on EC2 as long as you use a non-multicast discovery method.
Deployment Simplicity JGroups would be baked in; worst case, you need to define a couple properties per server (IP of one or more nodes to connect to, and perhaps something like “is this a temporary node”)
Implementation Simplicity More DIY.  The basic protocol stack is provided, along with clustering, leader election, “state transfer” (to bootstrap a new node with a current picture of the task tree) and the lower level; we’d just have to tie it together and fill in the gaps for application-specific needs.
Event bus JGroups can also give us message passing, taking the place of a simple JMS topic.  I have a lot of questions here, though:  how does this handle nodes dropping?  Does it buffer events until the node comes back on?  If so, how do we handle nodes that permanently drop?


I have a followup post about Hazelcast here, and a brief conclusion here.

On Configuration

I’ve been musing recently on how to scale our configuration system up and out: making it work more dynamically for single nodes, and handle multiple nodes…well, at all.

How it works now

Right now, our configuration is spread across the following places, roughly in order of precedence:
  • System properties (-D properties set on the command line)
  • a user-editable property file, for overrides to default properties.  This is preserved during system upgrades.
  • a “system defaults” property file, user-visible but overwritten during upgrades.
  • a database table:  for a while, we were standardizing on properties in the database.  More on this later…
  • some property files buried inside jars, which hold some database queries, some default values, and so on.
  • log4j.xml
We use a MergedPropertyPlaceholderConfigurer, along with a couple other configuration classes available on github) that merges properties from most of the above locations together, ranks them in order of precedence, and then sets them at startup time using Spring’s standard placeholder syntax (${}). Database properties are loaded from a table using a special Spring property loader. So any property can be set in a higher-precedence location and it will override one set in a lower-precedence location. In practice, new properties tend to get set in the property file. Why? Because database changes require a patch (like a migration in the Rails world) which needs to get migrated to each applicable environment.  Deploying the code to a development or test server then requires both a code update and a database update. In practice, the dependency between the code and particular database patches is a bit of a hassle – certainly far more so than just adding it to a property file which gets deployed along with the code.  A bad motivation for keeping properties in files?  Perhaps… but it is the reality of it. A system that raises the barrier of entry for doing “the right thing” is a bad system.  Which brings us to…

Problems with current system

  1. property files are cumbersome in a distributed environment.  Many of our deployments are single-node, but more and more they’re distributed, and distributed deployments should be our default going forward.
  2. For properties stored in the DB, adding, removing or updating any property requires a DB task, and then a DB refresh, which has the effect of discouraging parameterizing things. You tend to think, “eh, well… I’ll just hard-code this for now…”
  3. Properties are loaded at startup time only – you can’t change property values without changing each property file and then restarting each node.

Requirements for a new system:

I’d like to borrow requirements from
  1. Reloading a configuration should be a simple operation for the operator to trigger.
  2. It should not be possible to load an invalid configuration. If the operator tries to do so, the application should continue running with the old configuration.
  3. When reloading a configuration, the application should smoothly switch from the old configuration to the new configuration, ensuring that it is always operating with a consistent configuration. More precisely, an operational sequence that requires a consistent set of configuration parameters for the entire sequence should complete its sequence with the same set of configuration parameters as were active when the sequence started. – For us, this is actually pretty easy.  Our app depends on a task distribution framework, meaning that work is defined as a series of tasks with defined beginnings and endings.  So, we merely need to load the configuration at the beginning of each discrete unit of work.
  4. The application should provide feedback so that the operator knows what the application is doing. Logging, notification or statistics about configuration reloads should be available.

    …and I’d add:
  5. We should be able to set configurations for all nodes at once (this could mean using the database, or perhaps a command-line tool that sprays configurations out to the various nodes, plus a web service to tell nodes to reload..or something else entirely).
  6. We should be able to view the current configuration for each node easily.
  7. We should be able to share configuration between our app and other related applications, again, this could be database, or a web service that exposes our properties to other applications.

Current thoughts

At the code level, I’m thinking of loading properties at the beginning of each task, using a base class or something built into the framework. Reloading and interrogating the configuration could be via a web service (get_configuration / set_configuration). For requirement 3, the easiest option seems to be to use Configgy as a configuration base. As far as centralized configuration goes, I’m up in the air. Some options:
  • Spraying config files (scp’ing configuration files to each server, which would have to be tied to either an automatic poll of files, or a manual “reload_configuration” web service call)
  • distributing configuration using a web service (node 2 calls get_all_configuration on node 1, and sets its own configuration accordingly) – but it would need to be saved somewhere in case node 2 restarts when node 1 isn’t available. The database is an option, but has development-time issues as noted above.
  • saving all configuration in Zookeeper.
What i’d really like, though, is a configuration system that kept properties in an immutable data structure that kept track of where properties came from – so, I could define the locations properties should come from, and then in the application I could say, “config.getProperty(‘foo’)” and get the value with the highest precedence (whether that’s from an override file, a database table, or whatever). But I could also say “config.getPropertyDetails(‘foo’) ” and get a list that said “property ‘foo’ is set to ‘bar’ by local override, is set to ‘groo’ by the central configuration server, and the ‘moo’ as a fallback default.” Now, why do I want that? Mainly for on-site debugging: “I set the property in this property file, but it’s not working!”

Some related (external) links:

I’m open to ideas, as well… Anyone have best-practices for distributed configuration?