Monday, April 26, 2010

WC-NG Changes

In my last post, I described how libsvn_wc had become brittle and hard to manage. The WC-NG process is working to solve that problem, and though we're not yet done, I believe we're on the right path.

The basic question that needed answering is, "where did we go wrong?" While version control is a hard problem (especially if you version directories!), it does not inherently lead to a brittle library. Somewhere, we had gone wrong in the design, the data model, or simply the implementation.

Before I had started working on the problem (almost) two years ago, one of the Subversion developers (Erik Hülsmann, I believe) laid out his thoughts for a next-generation library. In those notes, he postulated on what I now call the Three-Tree Model:
  • the tree you checked out
  • the above tree, plus structural changes (add, delete, move, copy)
  • the above tree, plus content changes (file edits, property edits)
Any working copy operation generally affects one of these trees. svn update and svn switch work on the first tree. svn add and svn merge modify the second tree. Your editor and svn propset affect the last tree.

This was the key insight. In our "wc-1" implementation, the svn_wc_entry_t structure blended all three trees together. Making a change to that structure could have been operating on any of the three trees depending on its flags. Its checksum field could correspond to a checked-out file, or a locally-copied file. To determine, you had to look at the schedule field and the copied field. And hell will rain upon you, should you mess up the flags or forget to check one.

For WC-NG, we have built a new data storage system with an API designed around this three-tree model. This has isolated our storage mechanism behind a solid encapsulation (wc-1 code had too much knowledge of the old "entries" storage model). Operations are now understandable: "copy nodes in the restructuring tree" instead of "set entry->schedule".

This new storage subsystem could produce an entire post on its own. It is radically different from the prior model (a single .svn subdir at the root of the working copy and SQLite-based storage). This is causing huge challenges in upgrades/migrations to the new format, and backwards compatibility for our classic APIs.

Another radical change was our move to using absolute paths to refer to items. The prior model used an "access baton" which implied a relative directory, along with a path relative to that baton. These relative batons and paths caused enormous problems because it led to the question, "relative to what?" In most cases, the answer was "the operating system's current working directory," which is a terrible basis for a deterministic API. In switching to absolute paths, this rendered the access batons obsolete. Since they were a core part of the public API for libsvn_wc (not to mention the widespread internal changes!), this has had a huge impact on the API and its users (such as Subversion's libsvn_client library and its command-line tools).

These two items (data model and absolute paths) are the core changes in WC-NG. The ripple effect from just these two items is immense. We will need to rewrite almost every one of the 40,000 lines of code in the library. And given our incremental approach, many of those will be changed multiple times. We're a solid year into this (although we saw downtime last fall due to our move to the Apache Software Foundation), and we probably have another several months of basic grunt work ahead of us. Stabilization and testing will put our 1.7 release into late summer or possibly this fall.

I could really go on and on about this stuff, but I hope this post provides some basic background on the WC-NG efforts. Please feel free to post any questions (I have no idea what aspects you may want to hear more about!), and I'll work on answering them.

5 comments:

Wim Coenen said...

I'm really looking forward to the centralized .svn metadata. Every time I introduce a new programmer to SVN, it seems the first thing they do is thoroughly mess up their working copy by moving versioned folders around in windows explorer...

Ben Collins-Sussman said...

I'm confused; why are you guys spending energy on "upgrading/migrating" existing wc's to the new format? When it comes to repository format changes, I totally get it. But I thought wc's were cheap, shallow workspaces -- just throw them away and checkout a new workspace. Upgrading a wc in place seems like a huge waste of time to me.

djcbecroft said...

With regards to the three trees, wouldn't an 'svn merge' modify both the second and third trees, not just the second? Merge changes both the structure, and the file content. Or did I miss something?

Greg Stein said...

Ben: most users are not system administrators. They come in to work and find they're using an upgraded Subversion (and no longer have access to the previous release). And sysadmins don't know "all" the working copies created by the users, let alone whether any of those working copies have pending changes.

We could say, "commit all work before upgrading", or "users should construct patches from their old working copies, to apply to new working copies", but I believe both of those are unworkable.

There are millions of svn users... I think we can spend the effort to ease their migration.

Greg Stein said...

@djcbecroft: oops! yes, you're correct. I was mostly trying to get across that it doesn't change the "checked out" tree whatsoever (which was very hard to reason from the old data structures).

And on this, I guess that I should also correct my statement about "svn update". While it targets the tree that you've checked out, it can alter the latest edits ("eliminate" some because they matched what was on the server, or insert conflict markers). Merging rules specify what will happen to text/props. It is also possible for an update to affect the restructured tree, but that will usually leave a tree conflict marker ("you added a file, and so did the server; I'll change your add into a content edit of the existing file").