Git at Magnolia – now a reality !

Back in December (I had to look it up), we decided to switch Magnolia’s codebase to Git. As far as I can recall, there were mostly 2 factors that drove our decision:

  • popularity. That might seem shallow, but when thinking about ways to attract contribution and participation, it really makes sense. If a tool makes it easier to contribute, for newcomers and seasoned Magnolians alike, then we should go for that tool. Git is that tool, and GitHub’s success quite likely largely contributed to the fact that many developers are now familiar with the concept of a “pull request”.
  • branching. Veterans (gosh I sound old) in the core team have been scarred by, and scared of, branches in Subversion for quite a while now. We hardly ever branched for anything else than maintaining older releases, and that made “innovation” difficult, for a ton of reasons that I suppose are fairly obvious.

With that in mind, we started thinking about when and how to migrate. We had a couple of big releases coming up, and thought we’d wait until that was done. We took that opportunity to schedule a couple of training sessions with Matthew McCullough, after I’ve had the chance to attend one of his intense and packed one-day trainings in Basel.

Migrating a project like Magnolia was no small feat. Our Subversion repository has close to 8 years of history, with its first commit dating back to the 4th of November in 2006 ! (and its commit message always makes me cringe: “Importing Magnolia 2.0 beta src of 21.Oct.2004″ :D).

In addition to the long history, the main project has about a hundred branches and tags.

And in addition to that, we have 60+ public projects (community and enterprise edition modules), as well as countless private/sandbox ones. In the Subversion world, we always only used a single central repository. We knew that in Git world, we’d have to split that into one repository per “project”, and so we had to take into account the additional challenge that most of the codebase had been moved around the repository a couple of times across its long history.

We knew we had several options for migration, and we actually considered “dropping” the history, but we thought we’d give a shot at keeping it, and see how that looked like.

We also decided on some sort of “progressive” migration, where we’d first migrate the main project, then community modules, the enterprise stuff, and lastly internal and sandbox projects.

I started testing git-svn with small modules, and knew immediately that I’d better script the hell out of that, because I didn’t want to have to remember every option to use for the hundreds of projects and modules, some of which I knew wouldn’t be migrated until months later.

What I wanted out of such scripts was not only the actual migration of the repository, but as well, some cleanup procedures (to make them more Git-like), and most importantly, I also wanted to automate a 1:1 comparison of each and every SVN tag and branch to their newly migrated Git counterpart.

If you have a local “copy” of your Subversion repository, it turns out that migrating modules like data or dms was rather fast, and in a matter of minutes, you could have magnolia-module-dms turned into a usable Git repository. But I quickly realized that there are also a couple of obstacles that are pretty typical when you migrate a Subversion repository to Git:

  • keyword expansion. This simply does not exist in Git, and I’m not going to whine about it. It’s always been a mess in SVN, especially since most of the time we didn’t remember (or know) that we needed to enable expansion via a property on each and every file or directory. But the verification script I mentioned earlier had to take that into account. Lots of find, grep and sed later, I had a working solution.

  • ignores. Actually, this is a plus. svn:ignore properties are scattered all across your project, and there’s no simple way to see what they contain (find . -type d -exec svn pg svn:ignore {} is the “best” I can think of). In fact, git-svn provides a tool to help recreating them, and it works reasonably well, but it took a while before I “got it”: one needs to recreate them for each and every branch, and for a while, my scripts weren’t handling branches correctly.

  • empty directories. In the case of Magnolia, I realized we had a bunch of them in our Subversion repository, but more as the result of accidents than anything else. Most of them are useless; the few that are useful are only useful during development, and that’s probably not even entirely accurate, but this wasn’t the time to spend investigating that and/or to fix some code that would depend on such a directory to exist. In Git world, the usual workaround to keep empty directories in the repository is to store a placeholder file, typically a .gitignore, or, perhaps a better idea, a README.txt file explaining the purpose of the directory.
    However, we realized that only the main project had such possibly-relevant empty directories. For all other migrations, we decided to not use git-svn‘s --preserve-empty-dirs flag. Turns out we did not use it for the main project either in the end: after crashing several times and running for almost 3 weeks, the first test migration of the main project finished, and results looked decent, but it looked like --preserve-empty-dirs was either buggy, or I completely misused it and most branches had completely wrong “empty” directories. I’d also realized that the process kept on adding items in the repository’s config file to keep track of the empty directories, and growing it to a size where Git itself could not handle it. When I realized that, I decided to try the migration again, this time without the --preserve-empty-dirs flag. Holy moly ! From 3 weeks or more, the time it took to migrate the whole thing went down to 2 or 3 days ! (yes, really)
    So I ended up adding another little function to the script, which was able to look at the Subversion repository and recreate the empty directories in each branch of the Git repository, after the fact.
    The comparison scripts also had to account for this, simply by deleting the empty directories before diff’ing the SVN and Git branches and tags.

  • repository size. This is something we haven’t looked into yet. Not often such an issue with Subversion, in that when large files become an issue in a given project, you can often just move them away to another location, and checkouts/updates become “fast” again. With Git, unfortunately, as soon as “big file” enters the history of the repo, it will impact the clone operation forever. Git’s filter-branch should help with this, and there are good examples in the Pro Git book, so I’m not too worried. The STK repository is probably going to be the first use-case for this.

With that said, the migration is now pretty smooth. And in case you came to this article in the hope of finding out how we migrated, hey, I just published the script in question ! It’s a little tied to our infrastructure, but it shouldn’t be too hard to re-use or adapt. It’s fairly well documented (i.e I stopped and wrote a comment every time I was forced to use one of Bash’s arcane syntax or idioms), and I’ll happily accept pull requests to make it more flexible !

I started writing those scripts in Bash, naively thinking they’d be rather trivial. I ended up learning a lot more than I expected (and wanted !) about Bash scripting. And I cursed myself several times along the way, thinking it’d be the last time I’d write something non-trivial in Bash. But I couldn’t bring myself to start all over again, so I sucked it up, and this is the result.

Well, if you’ve been following our forums or user-list a little, you’ll know that by now, all our public projects are migrated ! Next up, we’re planning to give the option to Forge project to migrate (and to create new projects on Git of course), and we also need to migrate our /build stuff. I’ll probably take that opportunity to migrate and publish some more of our internal tools. I’ve been cooking a bunch of Groovy scripts that deal with all sorts of things – nothing revolutionary, but it could be interesting to get some feedback about those too.

But this isn’t the end of the story. Git is more than just another source control revision system. It forces different behaviors, hopefully for the best ! Developers that have been using Subversion for years (decades!) are sometimes disgruntled, because it is hard to change your habits; it is hard to sit down and open a book, and think about what you really want to do, change the way your brain is wired and learn new things, when you have to push a release out before leaving the office. It is going to be difficult, and it is going to take a long time, but in the long run, I hope we’ll all see the benefits of this migration.

In another post, I will try to explain why we stuck with our own infrastructure and why we chose the tool(s) we chose, when GitHub exists, and most people will claim (in part rightfully) that maintaining your own infrastructure in 2012 is crazy ! I’ll also try to expose how our repositories are mirrored in GitHub (because hopefully by then they will be !), and how we can benefit from GitHub’s tools.

In the meantime, clone our Git repositories, and get your pull requests warn and ready !

For information about our repositories, see the following two links:

2 thoughts on “Git at Magnolia – now a reality !

  1. Pingback: Git at Magnolia – now a reality ! | CMS Radar

  2. Woohoo, congrats! I know this has been a long road (especially if you include the pre-December discussions and planning), and I’m delighted to see this day come at last. I’ve come to like git so much that I now use it in front of Subversion wherever I’m still having to work with code in that system.

    Anyway, well done on the migration and the info-sharing.

Leave a Reply

Your email address will not be published. Required fields are marked *


You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>