Explaining the Recent Hackage Downtime

Roughly two weeks ago, on April 12, we had about a day of Hackage downtime — the most significant downtime Hackage has experienced in years. This was not due to any issues in the Hackage codebase, but rather user error — mine specifically. I had been performing administrative maintenance on our Rackspace virtual boxes, trying to shut down those that are no longer needed and maintained. This is important so we can have an accurate inventory of the services we use, in preparation for the eventual termination of Rackspace’s donations of infrastructure to FLOSS projects. Due to a cross-wiring of naming conventions, I confused the main Hackage box with one of our decommissioned docbuilders. There were a number of things I could and should have checked before proceeding to delete it, and which I had in the past been conscientious about. Needless to say, on April 12, I was not conscientious, did not check the things I should have, and so deleted the box.

Our monitoring service notified us within a minute that Hackage was down, and so I immediately realized the error I had made. Unfortunately, once a box is gone in Rackspace, it is gone immediately, and so I was unable to restore anything after calling their support. This left us with the task of rebuilding Hackage. We hard recent (2wk old) full backups, and the configuration of the box is not hard to reproduce. However, there was a huge problem — all activity on Hackage for those past two weeks — uploads, revisions, etc. would be gone. Over this span, there were roughly 1,000 such actions. Furthermore, even if we were to decide that losing all that data was ok (it isn’t) then hackage-security would accurately detect the loss of this data as a potential rollback, and force a redownload of the proper, signed 01-index.tar. But this in turn would cause another problem, because all but the latest versions of cabal don’t handle re-initializations of index-files following resets properly.

Restoring the Service

The right thing to do, to avoid the loss of data (and attendant other problems) for end-users, was to restore all the missing data possible — and certainly enough to reconstruct exactly the last state of the 01-index.tar file. The data-source we could accomplish this with was our mirrored Hackage servers (which continued to function), which provided uploaded package tarballs, and equally importantly, the 01-index.tar file itself. In the hackage-security architecture, this tarball, which used to hold just a bundle of cabal files, now contains a wealth of structured information — the tar entries contain proper upload dates, revision information, and the userid and name of uploaders. Further, there are special entries that indicate when changes are made to preferred and deprecated package versions, as these affect solver plans as well. Herbert initially suggested how we could make use of this data to fix things up, and I spent the next day working to put together again the data I had destroyed. The result was a relatively short script that replays actions of an index tarball against a hackage db store.

Written in the midst of a recovery effort, the script requires more work to be polished and merged into the hackage-server mainline. Even so, this incident illustrates some of the strengths of the overall architecture. Most importantly, cabal-install is able to fall back to mirrors relatively seamlessly (and securely!). As such, the core usage of hackage as a package repository for automated tooling was not affected (though docbrowsing, discovery, and many other things were affected). In fact, with a sufficiently new cabal-install, the server isn’t even necessary to bootstrap the mirror list, as that information is conveyed directly in the DNS registry, through DNS metadata. Furthermore, the mirroring meant that all core information about the repository was preserved in multiple sites, so we were able to reconstruct the important actions – and because of the incremental, timestamped nature of the index tarball, we were able to reconstruct such actions relatively precisely. In terms of hackage itself, the programmatic access to the raw ACID-State store was a boon here, allowing code to cut through layers of abstraction and manipulate pure-haskell structures to get all the data in the right shape. Most amazingly – a testament to the codebase involved and also the Haskell language itself – once I got the script type-checking, and having tested its logic on a tiny subset of the full hackage data, when we set it loose on the real data, it got everything right the very first time, producing a bit-for-bit correct 01-index tarball on the first try!

I’m also happy that we learned from past experience enough to be able to rapidly communicate the problem widely across the right channels in a timely fashion. Also, it should be noted that while I banged out the code, Herbert coordinated restoring the backups, and doing much of the work on the new server setup, Duncan provided invaluable advice, and various other elements of the admin apparatus helped with other pieces, including advising on mail setup, nginx config, and soforth. Putting all the pieces back together again was a collective effort, and I’m massively appreciated to everyone who helped remedy the mess I created.

Improving for the Future

There are also lessons about what we could do better. First is better tracking and labeling of which boxes are what in our server administration (which, ironically, is what I was trying to do when I made my awful blunder). In the future, I think we should establish a policy that all provisioned boxes must be labeled with the responsible party, the purpose, and the date. In terms of server architecture, there are some things we didn’t restore programmatically but had to do manually — in particular, changes to maintainer groups. Many of these could have been inferred in the script, but it didn’t do so. Other information couldn’t have been inferred — adding people to the uploader group, maintainer group changes without subsequent uploads (which we would use to infer the change), etc. Further, we didn’t have mirroring of the doc-tarballs. So the docbuilder rebuilt everything it could, but some manually uploaded docs were likely lost. Whole-package deprecations (as opposed to version deprecations) are also not in the tarball, so if any occurred over the affected span, then we lost those.

For a really resilient system we need a combination of full snapshot backups with comprehensive mirroring. So what this suggests is some changes and improvements throughout the Hackage ecosystem. One place we could augment is hackage-mirror-tool which allows mirroring of hackage to static filesystem repositories. In particular, it should also be able to mirror doc tarballs, and perhaps optionally be able to expand them, to allow online browsing of mirrored docs. Further, it would be nice if it could generate some very skeletal static html to allow rudimentary browsing of packages as well. Another thing we could do is provide another tarball in hackage-server along with the 01-index tarball, which gives a similar compressed chronological store, but not of package metadata, but other public server metadata — accounts (though not passwords), maintainer groups, etc. Additionally, whole-package deprecation information. We could then teach the mirror tool to mirror those as well, and the replay tool how to make use of them.

Also, the “replay” approach seems mainly superior to the existing MirrorClient approach for mirroring packages to a full hackage-server deploy. The downside is the replay approach requires the server not be running, while MirrorClient mirrors to a running server. Perhaps there’s some way to improve the replay client, allow it to work in “mirror upload” mode as well as “full replay” mode, and simplify and improve the mirroring machinery in hackage-server proper as well. These are all interesting and important problems to work on, and the hackage-server and hackage-mirror-tool codebases would be very welcoming to new contributors who wanted to look into them. It would also be very nice for redundancy if someone were to volunteer to set up a further mirror of hackage, just using the tooling as it already exits.

Finally, this points to a social and not technical weakness in our current setup. I think I made the error I did because I’ve taken on, on the whole, too much. On the whole our admin apparatus is stretched thin with many responsibilities (with various key players now not having the time to contribute they once had), and rather than proceeding cautiously I let the combination of a felt sense or urgency and isolation drive me to substitute myself too much. And when one does that, one makes errors, sometimes really bad ones. Past calls for help have yielded people stepping up to help with different individual elements, which are proceeding slowly — for example, winding down community.haskell.org and archiving it, or improving the wiki. But with the impending need to actually migrate from Rackspace (which first means securing hosting elsewhere, something we’ve only just started to investigate — leads welcome!), we will need to have some concerted and coordinated effort, and a few more experienced sysadmins willing to devote some time over a protracted span and help alleviate the burden of overall responsibility would be a great help. People who might be interested in this should drop a line to admin [at] haskell.org.