On to Octopress

I’ve just switched the blog over to Octopress. I’ve been using vanilla Jekyll for a while to keep up this blog, and I put a bit of effort into configuring it, building up a set of Bootstrap-based templates and so on. So, why switch?

At the end of the day, this is a utilitarian site. I’m not going to impress anyone with my design chops – I enjoy graphic design in theory, and I’d like to think that I’m not the absolute worst at it, but I do little enough HTML these days that it takes a long time to get things done, for end results that are never what I want them to be. And I’d rather put the time into writing actual content1. Meanwhile, my site wasn’t displaying terribly well on phones, and Octopress handles mobile devices well out of the box. And it’s beautiful to boot2.

So, I’m impressed so far. Thanks Brandon!

  1. Not that, in practice, I’ve been writing actual content.

  2. What a weird phrase.

In Defense of Easy

the setup

I watched Rich Hickey’s talk from Strange Loop called “Simple Made Easy”, which has been making the rounds recently1. He’s a great speaker, and always makes interesting points. But something in that talk didn’t sit right with me2. While his slide about Development Speed (“emphasizing ease gives early speed…ignoring complexity will slow you down over the long haul”) really struck a chord3, it felt like he was shortchanging “easy” in favor of “simple”4. This, then, is my long, somewhat circuitous exposition on that thought.

the case study

I’m working on software that I’ve been involved with for several years. Like a lot of projects, its evolved as the company’s (and their clients’) needs have changed, and it’s been written under schedule pressures and ever-shifting requirements – in other words, it’s like most software, as far as I can tell. I dream some days of VC-funded startups in their Herman Miller chairs building beautiful software from scratch with plenty of cash and plenty of time, but I suspect the grass isn’t really so green over there.

The application has gotten to the age where the layers of rapid development are starting to take their toll: new developers coming onto the project complain that the software is just too complex. By ‘complex’, they mean, “it’s hard to wrap my brain around all of the moving parts.” Some of the symptoms of this version of ‘complexity’ include:

  • it takes a very long time to get new developers up to speed
  • there are a fair amount of bugs caused by changes made to one part of the application which had downstream impacts on other parts of the application. Since most developers don’t have the “big picture” in mind, they don’t know they’re breaking things when they make changes.
  • writing new features takes a long time, since the developer needs to know a lot, juggle a lot, and dodge all kinds of pitfalls in order to put something new in place.

When I interviewed developers about where the problems were, I got answers like this:

  • You have to keep too many things in mind at any one time.
  • what different parts of the application are doing, upstream and downstream of the piece you’re looking at right now – how all the pieces fit together.
  • details of the domain: you have to have some familiarity of what the end goal is in order to understand what things are doing.
  • terminology: there’s a lot of terminology that is domain-specific, but it also isn’t entirely consistent throughout the application.
  • the pattern at work in the part of the application you’re looking at: different parts of the application follow different general patterns to do what they do.
  • the version of the code you’re looking at: since there are several clients using different versions of the application, some several years old, even if you’ve gotten your head around a section of code in one codebase, there is no guarantee that the accomplishment will translate to another client.

But, is simplicity the root issue? How do we fix it? And how would we prevent it in the first place?

theoretical aside

Here’s my limited understanding of educational psychology:
We’re built to only handle a small number of things in working memory (7 things, on average – like a phone number5). To understand more complicated things, we have to either “swap out” to longer-term memory or synthesize several related things into one well-understood thing.

Swapping out takes a long time, and is tiring…how many of us have been reading through technical documentation and gotten to a point where we said, “Oh, wait… the author explained what that term meant, or what that component did…what was it again?” That’s the effect of running out of working memory and having to look in long-term memory to find something you recently read. But more often than not, that factoid probably hadn’t been committed to long-term memory yet, so we had to go back and re-read part of the document. Synthesizing is the process whereby you understand a group of related things as a whole, so that you’re able to think of the whole as one unit, rather than the component parts. This takes time: you have to “wear grooves” in your brain in an identifiable pattern. For example, when we see a face, we don’t see individual facial features – ears, a nose, a mouth, and eyes – and then try to juggle all of these features to understand whose face it is. We recognize a face as an identifiable whole. Through many repeated exposures, our brains have synthesized that pattern as “a face”, and deal with it as a single item in our memory. That’s also why elementary school teachers drill math facts – if you have internalized that 8 x 7 = 56, you don’t need to store 8 and 7 in working memory and perform the arithmetic…“8x7=56” is a discrete whole in your brain.

For folks learning a new piece of software, especially a large software system, there’s a lot of individual pieces that have not yet been synthesized into understandable, well-understood wholes. And for many large software systems – mine included – getting to that point of familiarity with is more difficult than it needs to be.

definitions

Now that I’ve butchered psychology and complexity science, I’ll move on to butcher Rich Hickey’s talk. In case you haven’t watched it, here’s the rough paraphrase of the definitions that he bases his thesis around:
* “complex” means combining two or more things (“complecting”). * “simple” means “concerned with only one thing”. * “easy” means “lie near” or “close at hand”. Relative to the speaker. * Conversely, “hard” means “(conceptually) distant”.

I think it’s fair to say that Rich felt simplicity is the important metric, while “ease” is superficial.

Meanwhile, it’s worth noting that Merriam-Webster’s definition of “complex” is, in part:

1: a whole made up of complicated or interrelated parts 2c : a group of obviously related units of which the degree and nature of the relationship is imperfectly known

(some unrelated definitions omitted)

complex or hard?

Here’s where things get dicey. It seems clear that making things simpler, in Rich Hickey’s sense, would be a real benefit. In this particular application, there are a lot of places that are combining “what to do” and “how to do it”. There are optimizations that have led to I/O-related code mixed into business logic, for example. And the level of abstraction is generally too high; our business logic has to concern itself with a lot of plumbing around “when” and “where”, instead of focusing on “what”. If each component focused only on the “what” of one portion of the business logic, it would certainly be easier to understand.

But it also seems clear that many of the developers’ complaints – “we can’t keep it all in our heads at once”, “we don’t understand how the pieces fit together”, and so on – aren’t really about “complexity” in Rich Hickey’s sense. They are in Merriam-Webster’s sense: “a group of obviously related units of which the degree and nature of the relationship is *imperfectly known*”. But by Rich Hickey’s definition, being hard to conceptualize is “hard”, not “complex”.

Perhaps there’s a broader way that the “single purpose” definition of simplicity could be applied to the application as a whole that I haven’t thought of. But at the micro level, lots of very distinct pieces with a single concern don’t necessarily make it easier to understand the big picture. In some cases, his suggestions may make that worse: a rules engine, for example.

Having important information (“should we follow path X or Y?”) abstracted out to a rules engine is, in Hickey’s terms, hard: it makes important parts of the application no longer near the code itself. So the code determines what, the rules engine determines why (or is it when? not sure) by, i assume, composing the “whats” into a control flow. And then when you’re trying to debug something, you have to go back and forth between the two. That’s simple, in that the two are separated. But the conceptual “surfaec area”, if you will, is much larger. You have to hold a lot in your head to understand the behavior of the application.

Perhaps generally, simplicity in this “single-purpose” sense help greatly when understanding the individual component, but make understanding the application as a whole harder (using, again, Rich Hickey’s definition of hard). But easy and hard don’t matter, right? They’re relative to the particular developer, they’re a matter of convenience…right?

Perhaps I am reading more into that dichotomy than I should. Casting things in black and white is a perfectly reasonable rhetorical device – especially in a fairly short talk – and that’s what Rich Hickey has done here. Exploring all of the subtleties distracts from the thrust of the talk, and simply doesn’t fit into that format. But I just want to explore what those subtleties are. Because in the case we’re looking at here, “easy” matters if it’s making things hard enough to understand that you have to work on the application for years before understanding enough to add new features. And they certainly matter when there’s only a couple developers who can get paged at 3 AM when things go wrong.

There are real advantages of “easy”, where “easy” is defined as “near to our understanding/skill set”. Less conceptual weight means more working memory can be applied to the problem at hand. Less conceptual weight means less effort expended on the tool itself, more effort to expend on the problem at hand. That is inherently pleasant for engineers – it is very frustrating to put a lot of effort into the tool instead of the problem. On the flip side, once a tool is mastered, it is very, very pleasant to be able to whip through a problem efficiently: see Vi vs. Notepad.

The argument is similar to the argument that Scala proponents make6: eventually, the concepts that are currently very conceptually weighty (so many things to keep in mind!) will eventually be internalized and synthesized, and then it is better, because you can attack your problems better. The question is, is the frustration of the synthesis time (the “learning curve”) worth it?

So perhaps my argument is one of degree. When understanding the application becomes hard enough that the ramp-up time for new developers becomes untenable for the developers themselves – not just the managers that want an easily-interchangable workforce – then steps need to be taken to make things easier.

fix it! fix it!

So, if hard matters, what can we do about it?

In this particular case study, there are several things that would help, and everyone reading this far has thought of several more. In fact, I’d be willing to bet that everyone reading said, “Well, clearly, what you should be doing is —-“, where each reader will fill in that blank a different way. That’s because the world of Software Engineering has come up with all kinds of ways to make large systems easier to deal with. Here’s a few:

visualizing the abstractions

Visualizing the various related portions together saves you from having to hold parts in memory as you look at other parts – it’s all available right in front of you. Seeing how things work together also helps you synthesize them into a whole. To get a clear picture by looking at the code, you have to have internalized what each of the components do and how they connect. And since there’s far more than 7 components in those networks, they don’t all fit in working memory. So until you’ve internalized and synthesized the components and how they interact, it just feels overwhelming.

Visualizing the flow of data through the system is a similar problem. We have sequences of jobs that read various values from the database and then insert or update other values. There’s no way to visualize at a high level what components set data and what components are reading that data.

When reading through the code, I’m essentially trying to construct a mental picture of how it’s all fitting together. Add in any abstraction layer, like a dependency injection framework, and things get even harder: “what components do” and “how the components fit together” are defined in two different places (Java files and Spring context xml files, for example, for the Spring framework). If I could somehow visualize it all on one screen, it would save me the effort of constructing a mental picture… ideally I would be able to see both what components do (popup or mouseover descriptions?) and how they fit together. That would be a lot less to juggle in working memory.

contracts

In cases where the pieces of the application need to fit together but developers aren’t likely to hold all of those pieces in memory, defined contracts are an age-old solution. The theory is, you don’t have to hold it all in memory; just the compoent you’re working on, up to and including the contracts it has with neighboring components. How they fulfill their contracts is outside your concern.

Whatever form those contracts take, you need to be able to explicitly specify (in an automatically-verified way) what each component expects from others.

In the software in question, there are contracts in place between components…why those aren’t making things simpler for developers is a subject for another blog post.

consistency

If everything followed the same pattern, there you would only have to learn that pattern once. Then hopefully, like a face, every new job you saw would just look like a different but recognizable instance of the same pattern.

Consistency also helps in tooling – if everything fits the same pattern, you could create visualizations that worked for every use.

Of course, you can only be as consistent as your actual patterns of use. In this case study, there’s not yet a single pattern that meets the needs of all parts of the system. Some are iterating over data elements and performing relatively simple calculations on each, while others are constructing huge in-memory graphs of time-phased data and using those for some rather complicated calculations. With some more thought, though…

Another form of consistency is a universal vocabulary: if everything used the same term for the same concept, you’d only have to learn the definition once. Terminology was specifically called out as a barrier to understanding. “Domain-Driven Design” practitioners stress a consistent “ubiquitous language” for this reason. Honestly, I’m not sure how big of an issue this ends up being; it is a huge early barrier for new hires, but one that fades pretty quickly.

comprehensive integration test coverage

This doesn’t really help in understanding the system, but does give developers some level of confidence that, while they don’t understand the whole system, they’ll know if their change has broken it. Depending on the tests, they can potentially serve as a form of documentation. And of course it has other benefits as well – like, having components that are tested.

And of course, the classic

documentation

As a last resort, documenting preconditions and postconditions, defining terms, explaining patterns and paradigms in static documentation can help.

Documentation is a distant last place. At best it can help you internalize some piece of the application, but in practice it will just presenting data that you then have to hold in working memory while you try to reason about the system. You as the reader then have to go through the effort of reading and reviewing it in order to try to commit it into long-term memory. The more complicated the component or concept is, the less useful documentation is.

Documentation is also out-of-date as soon as it is written, it’s never maintained, has to be discoverable for the person trying to learn about things, and is high-effort for both the writer and reader. I secretly suspect that, if we could track the number of times that any given page on our internal wiki has been really read, the average across all pages would be < 1. We write more documentation than we read.

There are many, many more. Feel free to tell me your “obviously, you should be doing —!” in the comments.

finally, a conclusion

I think Rich Hickey’s division between “simple” and “easy” is useful and instructive…but I think that many of the things that we care about in designing simpler software that falls into what he would call “easy”, rather than “simple”. And if you have a nontrivial number of components, with interactions that will not easily fit within a developer’s working memory until they’ve done a significant amount of synthesis… don’t discount easy. It can save you from that 3AM page.

  1. I started writing this in late 2011, when it was in fact “recently”. It’s now old news, and firmly in the canon of tech talks to watch. Since then, he’s given another wildly popular talk at RailsConf called “Simplicity Matters”. I won’t be referencing it here, but it’s worth your time as well.

  2. I’m well aware that, when I disagree with Rich Hickey, the chances of me being right aren’t very good.

  3. Yes, a chord. Which, being two or more interwoven notes, is complex. But that’s ok.

  4. Perhaps someday, I’ll distill it down into a shorter, clearer message. Maybe even a couple dozen powerpoint slides. And maybe then, someone else will point out that the broad brush I used for brevity covered over some other subtlety that they found very important…

  5. The famous “Magical Number 7, Plus or Minus Two”

  6. I like Scala, for the record.

Takes Too Long to Build Your Project? Try Harder.

So, we have a problem at work. Well, more than one, perhaps…but one of them is that builds take too long. I’ve talked to folks I know in different places, and this is a pretty common complaint. But of course, we have it tough. We have a large codebase and a lot of tests. The compile time for the codebase is one issue; I’ll get to that in another blog post. But the test time is worse.

Most of our tests are proper unit tests: they test a single unit of code, mock out all components other than the one under test, and run quite quickly. But we have a significant minority of tests that go against a database, and the development databases live out on EC2 instances in the Amazon clouds. The time to run all pure unit tests across the project is pretty significant, but the time to run all the unit tests on a developer machine is untenable. Running tests for one module (out of 10 or so) takes half an hour when the code is on a developer’s machine and the development database is on an EC2 instance.

Whoa, hold on there – I can hear some of you now: “Tests that go against the database!? Those aren’t unit tests! Unit tests CAN’T TOUCH ANYTHING! AT ALL!” Ok, fine. They’re not unit tests. Call them integration tests. Call them whatever you want. The fact remains, there is logic contained in the DAO layer, and in the queries themselves (some of the SQL queries are substantially complicated). Somehow or other, that logic needs to be tested.

So, there are a few options I know of:

Mock out the database itself

This would have to be a complete enough fake-database that we could test the queries themselves and catch unbound bind variables, bad column names, and things like that, while retaining full Oracle compatibility. I haven’t seen anything that actually does this, but if you know of it, let me know. In-memory databases like H2 are often used as a substitute for the database while testing, but they don’t work for us because we use a lot of Oracle-specific SQL extensions (partition clauses, SQL hints, and so on). Even Oracle’s own TimesTen database doesn’t support these, unfortunately. If they did work, they’d be a nice option, though.

Cut and run: Mock out as much DB interaction as possible, and put the tests that do hit the DB in a separate group that run only occasionally.

We tried this first. I created a Maven profile, so that adding -P quick to the Maven command line skips all the database tests. I could have done it the other way around, so that you had to add a profile to run the database tests, but I was afraid we’d never run them. I had nightmares of folks setting up CI jobs for a new branch and forgetting to add the magic “run the database tests, too” parameter, and not seeing database-related errors until it got out onto a server. Yes, we have a QA process after unit tests, but if a bug makes it to that layer, it’s made it too far; it’s holding up other testing. I may still move the tests that hit the database into the integration-test phase, but that feels like a hack; they’re not really integration tests.

There’s also something that makes me uncomfortable when there is a disincentive to writing tests. In an ideal world, we’d be able to say, “Write all the tests you want. Load all the test data you need. We’ll run them.” So the “quick” profile was OK, but I knew we could do better. And it just happened that, elsewhere in the company, some folks were working on another project:

Move the data to the computation.

The team that maintains the development databases created a VirtualBox image of a database server, so that developers could run it locally. No more network lag, and much faster tests (about 2x faster, in early benchmarks). But there were some problems: developer laptops are only so fast, and they have only so memory. I’d love to say that each devleper could have a separate desktop workstation to use as a development database, but unfortunately the budget isn’t there. And while we have 8gb of RAM on our Macbooks, after running a local version of our app, some development tools, and Firefox, we’re often hitting swap space. So… the dev VM is less than ideal. If I could get 16gb of RAM on my machine, then I’d be in business. But the ops team balked at that. Something about not being officially supported by Apple. So that leads us to…

Move the computation to the data

Hadoop does it, why can’t I? I think the first few options have a lot going for them, and are probably enough for a lot of environments. But if software gets big enough, there’s always going to be a point where the tests take too long to run locally. And as fast as your development machine is, you can always get a cluster of build servers that are faster. And that’s probably always going to be more economical than giving each developer a beefy PC. So it’s not hard to imagine eventually wanting to do your test builds out on the CI machine.

We use Hudson…er… Jenkins as a CI server already; it builds the proejct after each checkin. So all we needed was a way to make it build the project BEFORE a checkin – what a lot of people call a “try” server (as in, “try it before you commit it”).

I did a bit of searching, and a lot of smart people have solved this problem already: Etsy, for example, uses Jenkins as a “try” server. Their solution is clean and simple: just send the diff of your code against master, and Jenkins will apply the diff, compile, and build. Nice. Most other solutions seem to take similar approaches: Mozilla, buildbot (used by the Chromium project, among others)… and so on.

Unfortunately, we have a lot of modules and a lot of branches. The branches are unavoidable – we support a number of clients. The number of modules…well, that my fault. I split the software into a lot of modules several years ago to try to encourage code reuse between projects, but the cost has been much higher than the benefit. Unfortunately, it’s now costly to merge things back together – but that’s another post for another time.

So, rather than setting up one “try” build per branch per module, doubling the number of CI jobs, I opted to do things differently. I created a “try.sh” script that runs the “compile” and “package” portions of the build, skipping tests, and then sends the compiled classes and test classes over to Jenkins. The Jenkins job just unpackages the classes and runs the enclosed tests.

The Jenkins job, in this scenario, is entirely generic – it knows nothing about the code its testing. It doesn’t connect to source control at all. It doesn’t even see the source – we’re sending it compiled class files, and it’s just running the tests.

Is this the right solution for you? Perhaps, if your constraints are similar to ours. Read the Etsy post linked above, and see if that solution seems simpler: it may well be. If you want to set up something like we did, read on.

I’ll say upfront that this method has some disadvantages. The main disadvantage is that it’s slower. The traditional try server does a diff of your local code vs. the mainline, then sends that diff up to the server and applies it, then runs the build and tests up on the server. So all that goes over the wire is your diff, and all your machine has to do is diff the code – all building is done on the CI server.

My approach, instead, does the building on your local machine, and sends the artifacts to the CI server. This is a compromise: your machine may or may not build faster than the CI server (probably not) but the artifacts are likely a larger upload. So, if in doubt, try it out and see if it works for you.

Setting it up is easy. Since we use Maven, we just have to make sure each of our Maven builds creates a test-jar:

    <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-jar-plugin</artifactId>
                <version>2.3.2</version>
                <configuration>
                    <archive>
                        <manifest>
                            <addClasspath>true</addClasspath>
                            <addDefaultSpecificationEntries>true</addDefaultSpecificationEntries>
                            <packageName>${project.groupId}</packageName>
                        </manifest>
                        <manifestEntries>
                            <url>${project.url}</url>
                            <Implementation-Title>${project.name}</Implementation-Title>
                            <Implementation-Version>${project.version}-${buildNumber}</Implementation-Version>
                            <Implementation-Vendor-Id>${project.groupId}</Implementation-Vendor-Id>
                            <Implementation-Vendor>${project.organization.name}</Implementation-Vendor>
                        </manifestEntries>
                    </archive>
                </configuration>
                <executions>
                    <execution>
                        <id>jar</id>
                        <phase>package</phase>
                        <goals>
                            <goal>jar</goal>
                        </goals>
                    </execution>
                    <execution>
                        <id>test-jar</id>
                        <phase>test</phase>
                        <goals>
                            <goal>test-jar</goal>
                        </goals>
                    </execution>
                    <execution>
                        <id>source-jar</id>
                        <phase>test</phase>
                        <goals>
                            <goal>jar</goal>
                        </goals>
                        <configuration>
                            <classesDirectory>src/main/java</classesDirectory>
                            <classifier>sources</classifier>
                        </configuration>
                    </execution>
                    <execution>
                        <id>test-source-jar</id>
                        <phase>test</phase>
                        <goals>
                            <goal>jar</goal>
                        </goals>
                        <configuration>
                            <classesDirectory>src/test/java</classesDirectory>
                            <classifier>tests-sources</classifier>
                        </configuration>
                    </execution>
                </executions>
   </plugin>

That snippet is in a parent pom, so that’s taken care of for all of our projects.

The try.sh script looks like this, currently. It could be a lot prettier, but this was the relatively quick-n’-dirty experiment:

    #!/bin/sh 
    
    ###############################################################################
    # try.sh
    # 
    # The idea behind this script is to build and package up your project locally, 
    # then send the code (the classes and the test classes) to the Jenkins server
    # to be executed there. 
    # This relies on the build being configured to generate a "test-jar" artifact
    # It has some limitations: 
    #   * You're not going to have exactly the same classpath (yet)
    #   * It can't handle multi-module projects (yet)
    #   * You can't add Maven profiles to the build (yet)
    ###############################################################################
    
    set -u   # fail on unitialized variable
    set -e   # fail if any command exits with a non-zero status
    
    # Find our local directory, so we can locate the wait-and-notify.sh script later
    DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
    
    JENKINS_HOST=your_build_server_here
    JENKINS_HTTP_PORT=7700
    # This is set in Jenkins' configuration; you probably have to configure Jenkins to have a static port. By default it changes every restart.
    JENKINS_SSH_PORT=54512
    
    # clear out old builds, so we don't upload the wrong thing
    rm -f target/try-$USER-*
    
    mvn -Pdeva -DskipTests=true $@ package  
    
    _date=`date -u +%F.%H%M%S%s`
    _filename=try-$USER-try-$_date.tar.gz
    _filepath=target/$_filename
    
    # This would be the simple version: 
    # tar czvf $_filepath pom.xml target/*.jar  
    # ...but that only handles single-module projects.  This handles multi-module projects: 
    find . -name "pom.xml" -or -name "*.jar" | xargs tar -czvf $_filepath
    
    echo "created try package at $_filepath"
    echo "now uploading to Jenkins..."

    $DIR/wait-and-notify.sh $JENKINS_HOST $JENKINS_SSH_PORT $JENKINS_HTTP_PORT $_filename $_filepath $USER &
    #Backgrounded; we don't want to wait.  The script will notify on completion using Growl.

The “wait-and-notify.sh” script just looks like this:

    #!/bin/sh 

    set -e 
    set -u 
    
    HOST=$1
    SSH_PORT=$2
    HTTP_PORT=$3
    FILENAME=$4
    FILEPATH=$5
    USERNAME=$6
    
    scp $FILEPATH  $HOST:/tmp/$FILENAME
    ssh $HOST -p $SSH_PORT build try-build -p filename=/tmp/$FILENAME -p username=$USERNAME -s  
    # This is also an option, but jenkins-cli doesn't handle an encrypted private key, and who leaves their private key unencrypted?
    #java -jar jenkins-cli.jar -i ~/.ssh/id_dsa -s $1/job/try-build  build try-build -p attempt.tar.gz=$2 -p username=$3 -s  
    # This would be an alternative to using private keys for jenkins-cli.jar, but you'd need to enter a username and password 
    # --username `git config user.name` --password (something entered at the command line)
    
    echo "Job has been started on Jenkins. See http://$HOST:$HTTP_PORT/job/try-build for status."
    
    # Did the build succeed?
    if [ $? -eq 0 ] 
    then
      growlnotify "Try completed successfully"
    else  
      growlnotify "Try failed!"
    fi

The Jenkins job is a basic build with two parameters, one called “username” and one called “filename”. The build runs this script as its first step:

    echo "Building try number ${BUILD_NUMBER} for user ${username}"
    
    # This shouldn't be necessary -- Jenkins should keep each build separate -- but it seems that it wasn't the case. So we'll keep each build in a subdirectory by build number, which is easy enough. 
    _dirname=${BUILD_NUMBER} 
    mkdir $_dirname
    tar  xzvf ${filename} -C ${_dirname}
    cd ${_dirname}
    
    # unzip each of our -tests.jar files into a test-classes directory
    find . -name "*-tests.jar" -execdir unzip -o {} -d test-classes \;
    
    # unzip the normal jar files into a classes directory
    find . -name "*SNAPSHOT.jar" -execdir unzip -o {} -d classes \;

It then invokes a top-level Maven target: surefire:test -fae. This runs the actual Maven tests, but without any build step (since we’re invoking the surefire:test plugin explicitly and not the “test” phase. -fae tells it to fail at the end instead of as soon as it sees a test failure; it’s entirely optional.

There’s a lot of potential improvement. For example, right now it’s hard to tell just by looking at the Jenkins build queue who’s try is building. We could fix that within the build by using the description setter plugin. I haven’t tried that yet.

Try it out, let me know what you think. Or, if you have a better idea of how to get around long database-bound tests, I’m all ears.

Greasemonkey Is Underrated

I’ve heard about Greasemonkey a few times over the past few years. I remember reading a few people wax eloquent about it, mainly how great it is for changing the layout of web pages (like a more convenient Youtube page, or even changing Google News back to its old layout). And I remember the big security kerflaffle a number of years ago (wow, 2005 – has it been that long?). But I’d never had occasion to use it.

But when I started browsing real estate listings in earnest, I got my excuse. I like Zillow and Trulia, but the realtor I am working with has a website of his own that he prefers – clients can save listings as “favorites” and “possibilities”, make notes, and he can make notes or suggestions in return…which makes good sense. But the site is no Zillow or Trulia: it displays far less data, it’s a bit slow, and it’s fixed at 10 results per page. And when there’s over 100 listing my price range to sort through, every little bit helps.

Greasemonkey can’t help the site speed, but I’ve made some inroads on how much data it displays. I hacked up a little Greaseonkey script that runs in Firefox everytime I load the realtor’s search page. The script just goes through each listing and fires off a query to the Zillow API, returning Zillow’s value estimate, the historical min and max value of the house, and a direct link to the listing’s page on Zillow. I may add a couple other things, like tax value and last sold date (is it a flip?). I was excited about adding some neighborhood statistics, too, but unfortunately Zillow doesn’t have neighborhood data for my area.

Up next: look into the Trulia’s API (I haven’t checked if they have neighborhood information) and maybe Walkscore, Panoramio and Flickr (for neighborhood pictures). After that, some Google Map hacks…but that’s a bit more involved.

If anyone else is considering pimping their webpages – I highly recommend it – I recommend Mark Pilgrim’s Greasemonkey Hacks. I heard some disparaging comments about the docs for Greasemonkey, but that page is great. Several of the other docs on Greasespot’s wiki are worth reading as well. Those of us that are only passable Javascript programmers can be confused by Greasemonkey’s weird sandbox environment, so the documentation is necessary (objects in Greasemonkey are all wrapped with XPCNativeWrappers for security, in response to the security issues from 2005…). I don’t know if it’s mentioned in the documentation, but the newest Greasemonkey version has a “create a new script” menu action that makes getting started dead easy.

As an aside, I have to say that after taking the Stanford’s online Machine Learning class this past fall, I was excited about doing some data science to help in the real estate search (“how does the addition of light rail affect nearby housing prices?”). Predicting home prices was actually one of the driving examples through the class. But I’ve been disappointed in how difficult it is to be to get access to large sets of real estate data in the US. I guess I naïvely thought that since tax valuations and home sales are public record, there’d be a way to access a sizeable dataset. If it’s out there, I haven’t found it. And the listings themselves are almost all closed, living inside each area’s version of the MLS. The closest thing I’ve found are the API’s offered by Zillow and Trulia, but at least Zillow’s terms of service prevent you from storing their service responses locally, and they don’t return MLS results (i.e. most results added by real estate agents) in their search results via the API…you have to ask for the properties individually.

So I’ll have to play around with big data in other ways…

Cluster Management and Task Distribution, Part 3

Summary

This is a followup to my Zookeeper vs. Jgroups post, and the Hazelcast addendum.

I ended up needed caching before I needed cluster management, so after evaluating a few options I implemented a caching facade that could switch between Hazelcast and Infinispan (which is based on Jgroups). I’ve been really impressed with Hazelcast so far – no major problems. We did hit a bug with 1.9.2, but found that a patch release with a fix was already out.

It also has a convenient cluster service built in, so for our cluster management needs, it’s likely to win for us, provided we don’t run across any more problems in testing. The interfaces are easy and intuitive, and I don’t really have anything to complain about – except that there aren’t many heavy hitters that are openly using it.

Conclusion

For our particular set of requirements, Hazelcast appears to be the winner:

  • it provides several things we need (distributed data structures, cache, cluster management)
  • it’s easy to use
  • it’s easy to hide behind a set of facades so that we can switch it out if necessary
  • it’s very simple to configure for different client scenarios (it works on EC2, for example)
  • it’s worked well in testing
But if our requirements were different, I think Zookeeper would have been my choice. Specifically:
  • If I were coordinating more than a handful of servers – say, 50 or more – I’d go with Zookeeper. It was specifically made for coordinating large numbers of servers reliably. And it’s now the de-facto standard for that purpose: Facebook, Twitter, LinkedIn, and a lot of others are all using it for node management across large numbers of nodes. At that scale I don’t want to roll my own if I don’t have to.
  • If I owned installation and maintenance. The deployment complications of Zookeeper are one of the main downsides for our purpose; it’s another server to install and manage, which raises the operational footprint of our app. If we were running the app only within our own datacenter, no big deal – there are Puppet or Chef recipes to install it, and even pre-built EC2 images. That’s a very manageable hurdle. But we’re handing the app over to clients to install and manage, and Zookeeper adds more hardware requirements and an installation process that is good for at least a few more pages in the install guide. Hazelcast, on the other hand, adds just one line to a property file (cluster.nodes=x,y,z). Be kind to your ops teams.
There are probably also cases where JGroups is the right answer. It’s main advantage is its configurability – you can put together a nearly infinite combination of protocols stacks to suit various purposes. Check out the comments of my first post, for example, for sample code to handle node discovery in a non-multicast (e.g. EC2) environment. But if all you need is cluster management and some pretty typical services on top of that, there are simpler options. I’m going to be honest – after looking into Zookeeper closely, I’m looking for excuses to use it. But I don’t think that will happen right away.

Cluster Management and Task Distribution Part 2: Caching

The question

Which technology stack makes more sense for a distributed caching framework?

The backstory

I’ve gotten a fair amount of hits on this post comparing JGroups and Zookeeper for task distribution and cluster management. I wanted to write an update of what I’ve investigated so far… unfortunately, my free time to investigate this specific issue has gone kapoof. Getting rid of the single point of failure in task distribution has gone on the backlog again.

However, in the meantime, I’ve done some investigation on distributed caches. This was the same app, but a different need – I needed a quick, practical way to distributed data across the cluster, so I tested Infinispan and Hazelcast.

Infinispan uses JGroups behind the scenes to do its clustering; Hazelcast has its own homegrown set of protocols.

Zookeeper, for all its pluses as a cluster management option, doesn’t have anything to offer for this use case. And since distributed caching came up first as a production requirement, Zookeeper may lose out by default; either of these caches can also be adapted for cluster management and task distribution (in the form of a fully-replicated cache, if nothing else). So if they’re already in the stack, and using them isn’t too much work, that’s likely what we’ll use. It frustrates me to have to make technical decisions this way, but the way that our timelines go, that’s often the way it is.

Why not Terracotta?

Terracotta is often at the top of the list when talking about distributed caches, so why wasn’t it here? Three reasons:

  • This particular client has had bad experiences with Terracotta in the past, and was resistent to installing it here.
  • Terracotta requires a server component, which is more operational overhead. It also makes it more difficult to transparently switch between clustering technologies.
  • Terracotta’s license (at least, last I checked) is sort-of free, but with a stronger-than-normal attribution clause; adding commercial license cost wasn’t possible for this minor release. The attribution clause wasn’t entirely a deal-breaker, but it’s certainly less appealing.

The first reason there is more significant than the other two. Terracotta may still end up as a distributed caching option in the future.

Unfortunately, this isn’t as in-depth or thorough of an investigation as part 1. I’m not going to go back through the requirements and then see how the candidates stack up. Infinispan and Hazelcast, our two main candidates, have a very similar feature list on paper, so it came down to a hands-on investigation instead.

The plan

I already had a generic cache interface that the app was using, so my goal was to write an implementation for both Infinispan and Hazelcast and allow us to switch between the two at startup time. That proved to be a bit of a trick – both have Java configuration options (instead of XML-only configuration) in their most recent versions. Both have some tricks to them.

My goal was to configure either cache with just two properties:

cache.type=hazelcast
cache.members=192.168.221.10, 192.168.221.11, 192.168.221.12, 192.168.221.13

The first obviously tells the app whether to use Hazelcast or Infinispan. The second points to at least some of the other cache members. That’s necessary because multicast – which is the default discovery option for both of these products – isn’t available on EC2, which is one of our possible deployment options. For Hazelcast, the list of nodes doesn’t need to be a complete list; it will use the listed nodes as seeds for a complete list. Infinispan, on the other hand, is more unclear.

Implementation

I ran into a few issues. For example, in Infinispan, my clever code that parsed the address list, found if one of them applied to the current server, and set the bind address appropriately ran into some snags. Netstat showed the following:

tcp        0      0 ::ffff:10.192.171.127:8016  :::*                        LISTEN      

It was binding an IPv6 compatibility address. And my “separate port from address” code isn’t smart enough to handle IPv6 addresses, so this wasn’t going to work. I worked around it (in this case, I hardcoded out IPv6 addresses…that’s not on my roadmap right now) and eventually got both working.

Writing the code that actually used the distributed caches was painless. The implementations of my simple cache interface came together without any problem; both implement both ConcurrentMap and JCache’s Cache interface, so they both fit in easily.

Unfortunately, I was still running into Infinispan issues when a node died and then later rejoined the cluster.

Hazelcast, on the other hand, worked out of the box in this case, so that’s what is going on into testing at the client site.

All in all, Hazelcast surprised me. It seems to be gaining acceptance on the web: anecdotally, posts on Stack Overflow from two years ago tended to say “What’s Hazelcast?”. Last year they would say “Hazelcast? That’s unproven”. This year, when it’s showing up on recommendation lists. Granted: small sample size. But that’s what we have to go on. In my testing, it works out of the box and does what it’s supposed to.

Now, it’s worth noting that adding Hazelcast – or any distributed caching – doesn’t come without a cost. Unfortunately, I don’t have that cost quantified, although I’ll try to update this post with measurements if people are curious. But the time spent writing data to and reading data from distributed cache often slow down the application to the point where two machines with a distributed cache are slower than one machine without1. The break-even point depends on the technology, the workload, and the hardware, and how careful you are with what is written to and read from the distributed cache. The higher the consistency and durability guarantees of the cache, the more painful that cost will be.

It’s probably worth digging into this cost briefly. It might seem at first glance that this shouldn’t be the case: after all, you’re just storing data in memory, and when a machine on the cluster needs some data, it asks other machines before going to the database. As long as getting the data from another local machine is faster than going to the database and the cache hit rate is reasonably high, it seems like it should be faster. Alas, it is not the case.

It doesn’t work that way for a couple of reasons: first, because you don’t have any durability guarantees if, after calling cache.put(), it just lives in that first machine’s local memory. If that machine goes down, the data is lost. So most caches (including Infinispan and Hazelcast) will, by default, block the put() call until that data has also been written out to at least one other machine. Of course, if you don’t need that durability, you have options – you can turn off replication entirely, or let it happen in a write-behind queue, so that there’s only a limited window of time in which the data lives on only one machine.

The cache is also likely to be slower because attempts to be clever about which machine holds each element in the cache. It tries to distribute cached data around the available nodes, as well as offer a quick way to compute which machine owns each piece of data, so that it doesn’t have to broadcast a “who has cache key X?” every time you do a cache.get(X). By default, it does this using a hash function on the cache key. That means that, when you call cache.put(..) on machine A, there’s no guarantee that the data will stay on machine A. So after calling put(), you may have to wait for the data is written to the machine that is declared the owner of that value, based on the hash of the cache key. It’s worth noting that Terracotta doesn’t work this way – data ownership is a function of the node that last used the data. But Infinispan and Hazelcast both do.

Now, those who are used to distributed systems, this will be old hat, but it surprises a lot of people. Two machines is very often slower than one. That’s why folks moving to Hadoop and HBase or other dedicated clustering datastores designed to scale to dozens o hundreds of machines find that they don’t start equaling single-machine performance until they move to several machines. Anecdotally, that is often around 5 machines, although it depends heavily on the workload – how much data is being written to and read from distributed storage vs. other uses of time (CPU, other I/O).

So, the moral of the story: be very careful what’s going in and out of cache. “Transparent caching” cannot be – it has a very significant cost over and above the cost of just writing the blocks of local memory, UNLESS you remove enough consistency guarantees that it can happen less obtrusively after the fact.

In related news, Akka 2.0 will gain the clustering support that’s currently reserved for the commercial offering (“Cloudy Akka”)… I’ll be interested to see what that looks like.

Conclusion

I’ve split the conclusion into a separate post, here.

  1. If you haven’t seen it, check out the “numbers every computer engineer should know” from Jeff Dean’s talk “Building Software Systems at Google and Lessons Learned” at Stanford (slides, video). It has some really useful order-of-magnitude values you can use for back-of-the-envelope estimates of the cost of distributed caching.

Actors in Concurrency Terms

Intro

There’s a lot of talk on the web about actors. And if you’ve followed very much of it at all – especially the back-and-forth on blogs about pros, cons, and whether actors are great/overhyped/not as useful as XYZ, and so on then this won’t be new to you. But hopefully it’s interesting for the rest of us. This post was born partly out of my desire to test actors out for myself, and partly out of my desire to play with Akka, an actor framework for Scala and Java.

For a lot of people the reaction to this post will be “well, of course!” But I was curious, so I had to prove it to myself. What I want to do here is describe actors (in this case, Scala actors, but the principle should apply to any implementation – let me know if that’s not the case) in terms that those who are used to semaphore-based concurrency will be familiar with. I’m using the phrase “semaphore-based concurrency” as shorthand for the whole paradigm of concurrency in C, Java, and so on – locks, threads, other things you’ll find described in, say, Java Concurrency in Practice.

Why? Well, because I’ve been reading a lot about actors, and it’s starting to sink in – when faced with a concurrency problem I’m starting to think, “would actors help in this case?” And to answer that question, I need to know exactly how actors compare and contrast with the other concurrency tools at my disposal.

Important disclaimer: nothing here is intended as a knock on actors. It’s an observation on how they should be used.

I’ll head off a couple of likely arguments up front:

“You’re missing the point of actors!” I hopefully cover that pretty thoroughly below – there are a lot of advantages to actors.

Related: “That’s not the kind of thing that actors were designed for!” First of all, actors were originally proposed to replace all control flow, so I question the validity of the statement. But I’ve seen it said on many blog comments, or the related “use the right tool for the job” or “It’s just a tool in your toolbox” or “it’s not a silver bullet”. Those are all well and good; in this post, I’m trying to compare and contrast actors with the my go-to tools for concurrent programming, to put some parameters on when to use actors.

An important caveat, related to “not a situation to use actors” – this example (a cache) is not one you would pick to demonstrate actors. It’s synchronous by nature, which is one of the primary things that actor architectures try to avoid. Now, I assume pure actor architectures need caches, and I don’t know how to do a cache that’s purely asynchronous… but being synchronous neutralizes the primary advantage of actors, as I’ll explain later. Comparing a purely asynchronous use-case will have to be a separate post.

That’s a really, critically important caveat to state up front. I’m talking about uses of actors in situations that are not designed, from the ground up, as actor architectures. Spoiler alert: I’m going to conclude that using an actor as a drop-in replacement for standard semaphore-based design performs poorly. I’ll define “standard”, “design” and “poorly” down below, but this isn’t a knock on actors. It’s an observation on how they should be used.

“You’re comparing apples and oranges!” I hate that phrase. Pet peeve. All you’ve just said is, “You’re comparing two different things!” Of course I am. If they were identical, there’d be no point in comparing them.

Everything in this post has been said before, more succinctly and intelligently. If you’ve read a lot about actors and their pros and cons, this will be familiar territory. This post was born partly out of my desire to test it out for myself, and partly out of my desire to play with Akka, an actor framework for Scala and Java. If you look at this post (especially the comments), you’ll see discussions around a lot of this same territory.

There’s a lot more to actors than I’m getting to here. Actors originally were conceived as an alternative to semaphores, among other things (they were originally conceved as a replacement for all control flow). And it has some advantages vs. semaphores. However… you have to go through the same amount of work to create a scalable actor framework than you do to create a scalable semaphore-based concurrent data structure…but at least in a JVM-based language, your semaphore-based data structures are already there for you. Actor-based data structures are not. So really the message here is – if you’re looking at putting a single actor in place, like this, that’s not the way to go.

a) A lot of people say concurrency is hard. I don’t necessarily think so, but I see the point.

I have a couple of biases coming in: first, I don’t think concurrency as devilishly hard as some people make it out to be. I’ve certainly created concurrency bugs before, but I’ve caused a lot of other bugs too. I think locks are pretty easy to reason about. My counter-bias is that I like Scala, and I like the idea of Erlang (the OTP especially) and I’m subject to the normal waves of programming trends that say that actors are cool. And this entire post started as me looking for an excuse to work with Akka, an actor library for Scala.

b) A lot of people say actors are easier. It’d be nice to compare the two on similar terms.

Most introductions to actors start out with the argument that concurrency is hard, semaphore-based concurrency is particularly hard, and that actors are an easier way of reasoning about concurrency. In general, I think it’s true statement that it’s easy to get semaphore-based concurrency wrong – if you move beyond the trivial “lock the whole class” model. If you just slap a lock at the entry to every method, and have them all lock on the same thing – so that there is only one thread executing the class at any point in time – it’s pretty easy. Inefficient? Yes. But easy, and pretty bulletproof. Once you start getting more elegant (http://www.thinkingparallel.com/2007/07/31/10-ways-to-reduce-lock-contention-in-threaded-programs/ ) errors start creeping in. I want to explore the argument that actor concurrency is easier – but, similarly, move past the trivial “lock the whole world” concurrency model. But in order to compare like-for-like, we need to either compare trivial semaphore

2) First, let’s establish that actors are, in semaphore-based concurrency terms, the equivalent of class-level locking, but with perks. (or, with more suspense: Let’s see how a simple actor-based cache compares with a simple and not-so-simple cache implementations using semaphore-based concurrency)

Let’s see how a simple actor-based cache compares with simple and not-so-simple cache implementations using semaphore-based concurrency. In the next post, I’ll explore to improve the actor-based cache. Now, the cache is an interesting use case for an actor, because it’s synchronous. Many (most?) of the advantages from an actor-based architecture rely on asynchrony. I don’t think that makes this a bad example, but it is stacked against actors. But caches are necessary. Perhaps as a followup we can attempt an example that is fully asynchronous.

My conclusion up front, if you’re the “tl;dr” type: actors have a lot of advantages. They’re simple to reason about, they allow for some interesting design patterns. A caller who sends an asynchronous message to an actor returns immediately, even if the actor it is calling is terribly overloaded. The equivalent semaphore-based system using class-level locking would instead block the caller until it acquired the lock, which could be a long time depending on the level of contention. This is property of actors is referred to as “liveness”. There are some other potential benefits to actors. Having access to their own inboxes allows for interesting SEDA-style scaling possibilities, for example. And being message-based makes the transition from single-process or single-machine to multi-process and multi-machine somewhat natural, where a traditional app will require a service layer of some sort. Actors impose a “message routing” step which can be a convenient place to put logic around failover, distribution, scaling, and probably other things I’m not thinking of. Many actor frameworks allow for supervisor hierarchies, which offer a fault-tolerance mechanism, and a hot code replacement mechanism for in-place upgrades. The designers of Erlang knew what they were doing.

But the first advantage – they’re simple to reason about in concurrency terms – comes at a cost: they are simple to reason about because they are effectively using class-level locking. The most difficult part of getting traditional concurrency right is knowing what to lock, and locking it for only as long as you need it (and no longer). The actor pattern gets around that by effectively locking everything in the class – only one thing will be executing on that actor at a time, ever – and locking it for the entire duration of the method call. This works out in many situations because of the liveness advantage – yes, the lock is coarse, but callers don’t have to wait to acquire the lock. They just drop the message in the actor’s mailbox and continue on. So if all you’re looking for is simpler concurrency, and you don’t intend on taking advantage of the other benefits in your architecture, you are free to choose between actors and the coarsest of locks, at your preference. At the moment, you’re likely to get more sneers by using coarse-level locking, but that’s a psychological phenomenon of programming trends more than a real reflection on the merits of the pattern.

On a similar note, but more of a no-brainer: if: 1.) all you’re looking for is simpler concurrency, and 2.) you’re not going to take advantage of the other benefits in your architecture, and 3.) there is a ready-made concurrent data structure available to you, that concurrent data structure will beat an actor every time.

That raises a question that probably merits another post: if you’re considering using actors to replace a concurrent data structure, are you thinking about them all wrong? On the one hand, it seems quite consistent with most descriptions of actors I see: they’re to be used to hide mutable state. A map (that’s used as a cache, for example) might be mutable, so to handle concurrent access to it, you place it inside an actor. This is exactly the use case that’s described by , , and . And in this particular example, if the additional benefits of actors (supervisor hierarchies, mailbox introspection, possibly distributed routing …) aren’t immediately useful, then it’s hard to argue for an actor-based approach rather than, say, a ConcurrentHashMap. ConcurrentHashMap is a beautiful data structure – the lock-striping logic is far more sophisticated than anything you’re likely to write, and it’s already written and ready to use. So, all else being equal: use it.

This leads to an interesting psychological phenomenon caused by the cachet of various technologies: The problem is, using this kind of coarse-grained locking feels dirty. Looking at this, you just KNOW that it could be much better. So it’s very hard to leave it like this. And for good reason: with minimal effort, it could perform significantly better under heavy concurrency. But if you implemented the same thing with actors, it’d seem somewhat elegant. Actors are cool. Even though, from a concurrency standpoint, they are equivalent.

Our Cache Example

Here’s some steps you might go through when designing a cache.

First, you’d set out your interface. In this case we’ll pick a very simple interface – you can get a value, set a value, remove a value, and get a list of all keys. We’re using the last one mainly for logging in our test class.

The first example you might write looks like this:

This is what I’ll call “class-level locking” or “lock the world” – essentially, the class has one lock, and any method call of consequence has to acquire that lock.

That’s really not bad. But if you’d spent some time looking at java.util.concurrent, you might see some other options that were slightly more sophisticated. Perhaps you’d notice that half of your methods were doing writes, and half were just doing reads – and reads don’t really need to block other reads. So you replace the lock with a ReadWriteLock.

Or perhaps you realize that you can just synchronize access to the data structure itself:

And finally, you might replace this synchronized HashMap (which is, itself, rather coarse in its Java implementation) with a ConcurrentHashMap. This is a really sophisticated concurrent data structure, which divides the map into a series of stripes which each have their own lock, so a “put” or “get” only locks a portion of the map at a time. That implementation might look like this:

Future: Infinispan (claims to have a “mostly lock-free core”, which leads me to believe it uses spin-locking. But I know that in distributed operation it uses striped locks by default, and can be configured for per-key locks. I’d like to plug in some other caches like Hazelcast, as well.

b) Now, here’s how they perform under load. The test class is here, if you’re curious. Basically, it’s going to run one test thread per CPU on the test machine, run the SimpleLockCache once first for warm-up, then run a whole mess of iterations of “get” and “put”, at a ratio of 2 to 1.

i) 8 threads, Macbook Pro ii) X threads, Amazon EC2 instance.

c) Now, here’s an actor implementation

Now, here’s the simplest possible actor implementation. This is borrowed from the Lift framework (here), and adapted for this interface. It’s simply using an actor to guard access to a HashMap:

d) Here’s how it performs under load i) 8 threads, Macbook Pro ii) X threads, Amazon EC2 instance.

Perhaps unsurprisingly, it does a bit worse than our class-level locking example. And that’s because, of course, it is class-level locking. It’s synchronizing the whole class, such that only one caller will ever be able to execute on the thread at a time. Whereas the other semaphore-based examples allowed for multiple threads to execute concurrently in some situations, that can never happen in the actor-based example. Never. It will only ever process one message at a time.

In semaphore-based concurrency, class-level locking is considered crude. And it is. Since the class-level locking in an actor is hidden, it doesn’t come across as crude – at least, not to those of us who don’t use actors much. Perhaps because we just haven’t been trained yet to look at this and see “Whoa, bottleneck!”

I wonder if, when it becomes normal to look at this and say, “Well, obviously, you’ll want to spawn child actors to offer multithreaded access to this data structure, otherwise it’s a bottleneck…” that we’ll start saying, “man, actor concurrency is hard…there must be a better way”. Perhaps all “next big thing” panaceas are just the result of us not yet seeing the complications…eventually, we get used to the good things but notice the pain points more and more… and this is starting to sound like an old marriage, so I’ll stop.

Now, again – this is synchronous. Ideal actor architectures are asynchronous. But that doesn’t matter here – whether the calling thread was dutifully waiting for the message to return, or whether it was going about other business, it will still take a the same amount of time to process the thread because it’s a bottleneck. And it’s silly to think that you’re not, at some point, going to need the results from that call somehow. Bottlenecks matter just as much in actor frameworks. So you need to deal with the bottleneck.

The one advantage that actor concurrency has (that I’m aware of) is “liveness”. That is, if this were an asynchronous system, and if the cache was very backed up (lots of attempted concurrent accesses) then a call to the backed-up actor would return immediately, no matter how many other concurrently attempted accesses there were. That’s because the call to the actor is really just placing a message in its inbox; placing that message can occur quickly and return. The semaphore-based alternative would block, since the actual execution was happening in the caller’s thread. The actor would implicitly be executed in a different thread, so you gain that advantage. And in a highly-contested, asynchronous system, that could be significant. In a synchronous environment (which a cache will always be), there’s no difference – the caller has to wait for the result, so the liveness advantage is lost.

h. The way you deal with bottlenecks, though, is different in an actor-based system. 3.) So, basic actor concurrency looks like class-level lcoking in a semaphore-based system, but with benefits. To reiterate: it has some advantages: a. Advantages…. 4.) Next: I’ll stumble through some attempts to improve the cache, based on the same principles that worked so well in the semaphore-based version. a. If anyone has an already-written one out there, let me know, I’ve looked and haven’t found one.

Cluster Management and Task Distribution: Zookeeper vs. JGroups

The question

Which technology stack makes more sense for a distributed task-management framework?

The backstory

The target app I’m working on at the moment is a task-management framework: it distributes discrete units of work (tasks) to workers which may be spread across multiple servers.  Right now, it’s a single-master system: we can add as many nodes as we want (at runtime, if we want) but only one can ever serve as the “master” node, and that node can’t fail.  The master needs to keep track of a bunch of tasks, which have a hierarchical structure (tasks can have one parent and 0-n child tasks).  It manages the tasks’ interdependencies, checks on the health of the worker nodes, and redistributes tasks if they time out or if their worker node goes down. A lot of task-management systems put tasks in one or more queues (Resque, beanstalkd, etc.); that’s what we did as well, in our first version.  And this works well; you can model a lot of problem domains this way.  But our task data model is a bit more complex than that, though: tasks can have children, and can define whether those children should be executed in serial or in parallel.  So you can define that task A must execute before task B, but then tasks C through Z can execute in parallel.  When Task B executes, it can decide to split its workload up into 100 smaller tasks, B1 through B100, which all execute in parallel, and would all execute before C starts.  So the tasks end up looking more like a tree than a queue…modeling them as a queue gets a bit convoluted.  You could argue that the best answer would be to re-model our tasks to fit within Resque, beanstalkd, or the like, but at this point that’d actually be more work. And you can’t deny this is fun stuff to think about… So the single point of failure needs to go.  I’m looking at some of the options for instead distributing the data (the tree of tasks, who’s executing what, whether the nodes are healthy, whether a task has timed out) across multiple servers.

The requirements

Consistency
The task tree should be shared between more than one server, with guaranteed consistency.  I.e if server A is serving as the master node and it goes down, another server needs to be able to pick up and continue, with the exact same view of the current task tree.  We don’t want tasks to accidentally get dropped or executed twice because server B’s version of the task tree wasn’t up-to-date.

Elasticity
Nodes have to be able to come online and participate in the cluster, and then drop off again.  The full list of nodes can’t be hard-coded at startup.

Partition-tolerance
Ideally, we should be able to handle a network partition.  We can do that at the expense of availability: that is, if there are 5 nodes, numbered 1 through 5, and nodes 1 and 2 get separated from 3 through 5, it’s OK for 1 and 2 to realize they’re the minority and stop processing.   The following caveats apply:
  1. We need a way to handle an intentional reduction in servers.  That is, if we spun up 10 extra cloud servers for a few hours to help in peak processing, we’d need a way to tell the cluster that we were going to spin them back down again, so the remaining nodes didn’t think they were a minority segment of a partitioned network.
  2. In case of a catastrophic failure of a majority of servers, I’d like a way to kick the remaining nodes and tell them to process anyway.  But I’m ok if that requires manual intervention… that’s an extreme case.
Scalability
I’m targeting 100 nodes as a practical maximum.  In production, we’ve used up to 5 nodes.  I don’t expect us to go above 10 in the near future, so I’m adding an order-of-magnitude safety margin on top of that.  Of course, when the task management framework goes open source, 100 nodes will be quite feasible for users with different problem domains and deployment patterns.  In our typical deployment scenarios, customers prefer fewer, larger servers.

EC2
 The clustering option has to be able to run on EC2 instances.  It can’t depend on IP multicast.

Deployment simplicity
  As few moving parts as possible, to make integrating and deploying the software as simple as possible. Our app is often deployed by the client on their hardware, so we’d like to make their lives as easy as possible. And as the framework goes open source, of course, it’s going to be more helpful for a more people if it has a low barrier of entry.

Implementation simplicity
 The less code I write, the fewer bugs I introduce.

Event bus
 Nodes can also pass events to other nodes, typically by multicast.  We use JMS for this currently, and we can continue to do so, but it’d be nice to get rid of that particular moving part if possible.

There is not necessarily a requirement to persist all tasks to disk, in order to keep them safe between server restarts.  In our previous version, tasks were kept in a JMS queue, and we ended up turning disk persistence off – in practice, in the case of a failure that requires manual intervention (for example, a JMS broker failure, or the failure of all nodes that had tasks in memory) almost always means that we want to manually restart jobs after the system comes back up – possibly different jobs, possibly a reduced set of tasks to make up time.  We found that we rarely want to start back up exactly where we left off. So, if the solution automatically persists to disk, I’ll probably create a way to bypass that if necessary (a command-line flag that lets us clear the current tasks on startup, perhaps).

Zookeeper

Consistency Zookeeper pushes everything to disk – writes are always persisted, for safety. Zookeeper is designed for cluster coordination, and it’s well-thought-out.  It has documented consistency guarantees, lots of documented recipes for things like leader election, and supports subscription of listeners on tree changes.
Elasticity We’d have to pick three servers up front that would serve as Zookeeper nodes as well as normal nodes. Those nodes woulnd’t be elastic, but other nodes could come up and down without a problem.
Partition-tolerance The “live” partition would be defined as “the partition that can still connect with the Zookeeper cluster”.   I have to check on how Zookeeper handles network partitions; I believe it’s just a quorum-based algorithm (if one node is separated from the other two, it stops processing).
Scalability Three Zookeeper nodes could support 100 nodes without a problem, according to anecdotal experience.  At that point, we would have the option of moving the Zookeeper nodes onto dedicated hardware. Having that option is the upside to Zookeeper’s deployment complexity: you can separate them, if there’s a need.
EC2-friendliness Zookeeper uses TCP; it’s been used on EC2 before.
Deployment Simplicity The main downside for Zookeeper is its operational requirements:  two classes of servers (with Zookeeper and without), potentially separate server instances running on Zookeeper-enabled servers, fast disk access required (according to the ZK operations manual, two fast disks – one for logs, the other for … er..something else.)  That’s a significant increase in operational complexity for a product that is distributed to and maintained by clients.  That also means work necessary for us to make it as turnkey as possible.A Zookeeper-based solution would pick three or five nodes to be Zookeeper nodes as well as task nodes.  Ideally, for Zookeeper’s purposes, it’d be three dedicated servers, but that isn’t going to happen in our architecture.  So Zookeeper will have to coexist with the current app on servers where it’s installed.  For deployment simplicity, I’ll probably need to come up with a way to start up Zookeeper automatically when the server comes up on those nodes. Zookeeper used to not play well as an embedded app (according to some reports from the Katta project); that may be fixed now, but if not I may need to put that logic in the bootstrap script.
Implementation Simplicity Zookeeper would provide a single, shared view of the task tree.  The system would be up as long as the majority of Zookeeper nodes remained up; any node could be leader, as long as they could connect to Zookeeper to view and update the shared state of tasks.   Other nodes could potentially go straight to Zookeeper to query the task tree, if we were comfortable with losing that layer of abstraction.  Either way, it would be simple implementation-wise.
Event bus Zookeeper doesn’t provide events…unless you write events directly to Zookeeper.  It could, however, obviate the need for a lot of events.  For example, we wouldn’t need a “task finished” event if all nodes just set watches on tasks…they’d be automatically notified.  Same with hearbeats, node failure, and so on – they’d be taken care of by Zookeeper.
Zookeeper doesn’t list AIX as a supported environment, which is interesting.  We do have to support AIX (unfortunately).

JGroups

Consistency JGroups is a lot more of a DIY solution, although it supports guaranteed message delivery and cluster management out-of-the-box. With guaranteed message delivery, we could have consistent in-memory views of the task tree without too much effort; we wouldn’t need to write something that also persisted to disk.
Elasticity I have to look into how elastic JGroups can be without IP multicast.  I know it can work without IP multicast; I just don’t know exactly HOW it works. updated 01 March 2011: JGroups has a discovery method called TCPPING, in which you predefine a set of known members that new members will contact for membership lists. It also has a standalone “gossip router” that performs the same purpose. I’ve played with this; it works on EC2 without problems.
Partition-tolerance There are options. JGroups has a merge protocol, but it does require application-specific code to determine what exactly to do when the partitions come back together.
Scalability More speculative, because JGroups has the option for infinite combinations of protocols on its stack.  Worst case, the cluster needs to define two classes of nodes, similar to the Zookeeper implementation: “nodes which can become leaders” and “nodes which can’t.” But JGroups has been deployed with hundreds of nodes.
EC2-friendliness JGroups can be used on EC2 as long as you use a non-multicast discovery method.
Deployment Simplicity JGroups would be baked in; worst case, you need to define a couple properties per server (IP of one or more nodes to connect to, and perhaps something like “is this a temporary node”)
Implementation Simplicity More DIY.  The basic protocol stack is provided, along with clustering, leader election, “state transfer” (to bootstrap a new node with a current picture of the task tree) and the lower level; we’d just have to tie it together and fill in the gaps for application-specific needs.
Event bus JGroups can also give us message passing, taking the place of a simple JMS topic.  I have a lot of questions here, though:  how does this handle nodes dropping?  Does it buffer events until the node comes back on?  If so, how do we handle nodes that permanently drop?

Conclusion

I have a followup post about Hazelcast here, and a brief conclusion here.

On Configuration

I’ve been musing recently on how to scale our configuration system up and out: making it work more dynamically for single nodes, and handle multiple nodes…well, at all.

How it works now

Right now, our configuration is spread across the following places, roughly in order of precedence:
  • System properties (-D properties set on the command line)
  • a user-editable property file, for overrides to default properties.  This is preserved during system upgrades.
  • a “system defaults” property file, user-visible but overwritten during upgrades.
  • a database table:  for a while, we were standardizing on properties in the database.  More on this later…
  • some property files buried inside jars, which hold some database queries, some default values, and so on.
  • log4j.xml
We use a MergedPropertyPlaceholderConfigurer, along with a couple other configuration classes available on github) that merges properties from most of the above locations together, ranks them in order of precedence, and then sets them at startup time using Spring’s standard placeholder syntax (${property.name}). Database properties are loaded from a table using a special Spring property loader. So any property can be set in a higher-precedence location and it will override one set in a lower-precedence location. In practice, new properties tend to get set in the property file. Why? Because database changes require a patch (like a migration in the Rails world) which needs to get migrated to each applicable environment.  Deploying the code to a development or test server then requires both a code update and a database update. In practice, the dependency between the code and particular database patches is a bit of a hassle – certainly far more so than just adding it to a property file which gets deployed along with the code.  A bad motivation for keeping properties in files?  Perhaps… but it is the reality of it. A system that raises the barrier of entry for doing “the right thing” is a bad system.  Which brings us to…

Problems with current system

  1. property files are cumbersome in a distributed environment.  Many of our deployments are single-node, but more and more they’re distributed, and distributed deployments should be our default going forward.
  2. For properties stored in the DB, adding, removing or updating any property requires a DB task, and then a DB refresh, which has the effect of discouraging parameterizing things. You tend to think, “eh, well… I’ll just hard-code this for now…”
  3. Properties are loaded at startup time only – you can’t change property values without changing each property file and then restarting each node.

Requirements for a new system:

I’d like to borrow requirements from http://jim-mcbeath.blogspot.com/2010/01/reload-that-config-file.html:
  1. Reloading a configuration should be a simple operation for the operator to trigger.
  2. It should not be possible to load an invalid configuration. If the operator tries to do so, the application should continue running with the old configuration.
  3. When reloading a configuration, the application should smoothly switch from the old configuration to the new configuration, ensuring that it is always operating with a consistent configuration. More precisely, an operational sequence that requires a consistent set of configuration parameters for the entire sequence should complete its sequence with the same set of configuration parameters as were active when the sequence started. – For us, this is actually pretty easy.  Our app depends on a task distribution framework, meaning that work is defined as a series of tasks with defined beginnings and endings.  So, we merely need to load the configuration at the beginning of each discrete unit of work.
  4. The application should provide feedback so that the operator knows what the application is doing. Logging, notification or statistics about configuration reloads should be available.

    …and I’d add:
  5. We should be able to set configurations for all nodes at once (this could mean using the database, or perhaps a command-line tool that sprays configurations out to the various nodes, plus a web service to tell nodes to reload..or something else entirely).
  6. We should be able to view the current configuration for each node easily.
  7. We should be able to share configuration between our app and other related applications, again, this could be database, or a web service that exposes our properties to other applications.

Current thoughts

At the code level, I’m thinking of loading properties at the beginning of each task, using a base class or something built into the framework. Reloading and interrogating the configuration could be via a web service (get_configuration / set_configuration). For requirement 3, the easiest option seems to be to use Configgy as a configuration base. As far as centralized configuration goes, I’m up in the air. Some options:
  • Spraying config files (scp’ing configuration files to each server, which would have to be tied to either an automatic poll of files, or a manual “reload_configuration” web service call)
  • distributing configuration using a web service (node 2 calls get_all_configuration on node 1, and sets its own configuration accordingly) – but it would need to be saved somewhere in case node 2 restarts when node 1 isn’t available. The database is an option, but has development-time issues as noted above.
  • saving all configuration in Zookeeper.
What i’d really like, though, is a configuration system that kept properties in an immutable data structure that kept track of where properties came from – so, I could define the locations properties should come from, and then in the application I could say, “config.getProperty(‘foo’)” and get the value with the highest precedence (whether that’s from an override file, a database table, or whatever). But I could also say “config.getPropertyDetails(‘foo’) ” and get a list that said “property ‘foo’ is set to ‘bar’ by local override, is set to ‘groo’ by the central configuration server, and the ‘moo’ as a fallback default.” Now, why do I want that? Mainly for on-site debugging: “I set the property in this property file, but it’s not working!”

Some related (external) links:

I’m open to ideas, as well… Anyone have best-practices for distributed configuration?

Oh, Oracle

So, I was responsible for a pretty unfortunate bug today — no way around it, I messed up. It was classic — there was a “TODO” block where I meant to come back and finish some code, and no doubt got distracted by some very valid crisis.

Fortunately, it was caught before it affected production data, but it was in test, and visible, and it was scary that it had gotten that far.

But I couldn’t help but be bitter about how that block of code came to be in the first place: the code that contained the bug was part of an elaborate scheme designed to work around joining to a particularly large table in certain circumstances.

Now, it’s big data (well, at least tens of millions of rows) …we have to do what we can for efficiency. But it made me slightly bitter that the more we optimize for relational databases, trying to eke more and more performance, the farther we move from a clear data model, and the less we’re using the “R” in RDBMS. We go through contortions, and in the process introduce bugs.

I’m not sure who that might be a lesson for, but perhaps if anyone is dead-set on using a tried-and-true RDBMS to avoid bugs in newer systems, whether MapReduce or a “NoSQL” store, it’s worth noting that there is some tradeoff here: bugs in the data store in question vs. bugs you introduce into your own code due to the increased complexity as you contort your data model to make it scale where you need it to go.