I once did a big migration of an old Subversion (SVN) repository to a new Git repository, migrating all of the version history. It was not an easy task at all, and I learned quite a bit of new things about both Subversion and Git. This post is about how to do that.
Introduction and background
I will not dive into details on how Subversion and Git works and how they differ in usage and workflows. You can read some comparisons here and here.
For the rest of this blog post, I will use Git terms. If you are unsure about them, you might have a look at the explanations on GitHub’s glossary page.
The migration was done to better accommodate multiple parallel development projects, which required a better branch-merge flow. There would always be load of tree conflicts (e.g. a file or folder was removed, moved or created) when merging even a few revisions between branches. And we just got tired of that.
Also a thing missing was a way to separate commits that were developer reviewed from those that were not yet reviewed. And work in progress from finished work. We really needed to control what went on the test server. So, adopting a feature branch strategy, which Git actively promotes, was essential to an agile development process.
Then the last reason for migrating, was to do a big cleanup of the repository. The thing is, the old SVN repository contained a lot of different web solutions, more or less unrelated to each other. A so-called mono-repo. Different branches were created by different teams working on different solutions in the same repository. No-one used the trunk (equivalent to master branch in Git) anymore.
In short, these were the main issues:
- When someone downloaded the solution for one product, they would also download everything else.
- When branching or merging, everything else would also be carried on.
- There would be a lot of branches, where individual solutions would not be changed at all.
The main task was then to make a new repository for the specific product that we worked on, with all of its own version history.
It would have been easy, had we only needed one solution folder from the SVN repository. The Git command line tool actually has a somewhat easy process for migrating a single subfolder into a new Git repository (Atlassian has a guide to that).
However, this main solution (as far as our team was concerned) was dependent on a few solutions placed in their own solution folders in the repository. This meant that all of the SVN repository had to be migrated first, as a whole. Then the unneeded folders had to be not just deleted, but written "out of history" ("rewriting history" is actually a Git term), before it could be uploaded to a Git server.
The migration involved a lot of commands, as well as a lot of refinement to the process to prepare for a faster and more reliable migration, so the downtime would be minimal for the development teams.
My migration process – step by step
This section is about the steps necessary to migrate a repository with multiple folders, of which only a few should be kept in the new Git repository. All of the following commands can be executed in Git Bash for Windows.
Overall the process is like this:
- Draw up a list of correct user names.
- Make a full local Git repository from the SVN repository.
- Delete unneeded files and folders.
- Delete old or unneeded branches.
- Upload it all to a Git server.
Here it comes.
The first step is to extract the usernames of all the committers from the SVN repository by running the following command in the local working copy of one of the SVN branches. This is an optional step that can be skipped if it is not necessary or possible. But the migrated repository will look much nicer with names and email addresses of all committers.
svn log -q | awk -F '|' '/^r/ {sub("^ ", "", $2); sub(" $", "", $2); print $2" = "$2" <"$2">"}' | sort -u > authors.txt
The command will output a file named authors.txt in the current directory. It will contain one line per committer in the following format.
username = username <username>
You will need to manually change the names on each line (or as many as you can gather the data on) into something like this.
username = firstname lastname <username@domain.tld>
Now do a Git clone of the whole SVN repository. This command should be executed in the folder in which this temporary Git repository should be created. It will then create a new folder with the specified repository name.
git svn clone http://svn.server.test/repository/path/ --authors-file=path/to/authors.txt --prefix='svn/' --trunk=trunk/ --branches=branches/ NameOfNewGitRepository
If the whole repository is large, if certain commits are large or if you just want to get smaller chunks at a time, you can instead limit the initial clone to a limited number of SVN revisions. After cloning these smaller number of revisions, you can fetch batches of revisions from the SVN repository.
git svn clone http://svn.server.test/repository/path/ --authors-file=path/to/authors.txt --prefix='svn/' --trunk=trunk/ --branches=branches/ NameOfNewGitRepository -r 0:1000 cd NameOfNewGitRepository git svn fetch -r 1000:3000 git svn fetch -r 3000:HEAD
With all commits cloned and fetched from SVN, you can delete files or whole folders from the version history. Here I added folders with unneeded solution folders to the list. I also added the folders containing NuGet packages, node modules and bower components, as these can easily be downloaded again.
The tool used in the command is called BFG Repo-Cleaner. It is a much faster alternative to the official Git filter-branch command. I downloaded that tool from their website and saved it in the parent folder of the temporary repository.
Be advised that this tool looks for the specified files and folders recursively in all subfolders. So if you have files or folders with names reflecting other solutions, they will be removed too. It could be a service client class named after a solution in one of the folders you are deleting.
cd .. "C:/Program Files (x86)/Java/jre1.8.0_131/bin/java" -Dfile.encoding=utf-8 -jar bfg.jar --delete-files "file_to_delete.txt" --delete-folders "{packages,node_modules,bower_components,.vs,OldSolutionFolder1,OldSolutionFolder2}" --no-blob-protection NameOfNewGitRepository
After deleting the files, you need to expire all references to the now deleted objects, and then make Git compress the repository.
cd NameOfNewGitRepository git reflog expire --expire=now --all && git gc --prune=now –aggressive
Now all the specified files and folders are removed from all of the repository history. However the files still exists in the repository folder as "uncommitted files". They need to be deleted, and the easiest way to do that is by doing a hard reset of the local repository. This command will make the files and folders reflect the latest commit of the current branch (the master branch).
git reset –-hard
Another thing. The repository may contain a lot of branches, e.g. release branches or other old branches, which were never deleted or merged back to parent branches. As they are usually not needed in the new Git repository, they might be deleted with the following command. Following is yet an expiration and cleanup of unneeded files.
git branch -d -r `git branch -r | grep -E '(dev-|rel-).*'` git reflog expire --expire=now --all && git gc --prune=now –aggressive
Now, there are a number of remote SVN branches in the repository. It is not possible to push a branch from one remote to another. So they will first have to be created as local branches. This command iterates through all branches that are prefixed with "svn", and adds corresponding local Git branches.
remote=svn ; for brname in `git branch -r | grep $remote | grep -v master | grep -v HEAD | awk '{gsub(/^[^\/]+\//,"",$1); print $1}'`; do git branch -f $brname refs/remotes/svn/$brname; done
When the branches were created locally, the trunk branch was also created locally. However, the trunk branch in SVN corresponds to the master branch in Git, which is already created. So the newly created trunk branch can be safely deleted.
git branch -d trunk
At this stage the repository should be ready to be uploaded (pushed) to a remote repository. Get the full URL to the repository at the Git server and execute this command (replace the URL part with the URL of your own repository).
git remote add origin git@gitserver.company.local:username/repositoryname.git
Finally the branches can be pushed to the new remote repository. With this command all local branches will be pushed to the Git server with a single command.
git branch | sed "s/^[ *]*\(.*\)/\1:\1/g" | tr "\n" " " | xargs git push -v --tag --set-upstream origin
With the repository pushed, you should be ready to work on the new repository. I recommend that you clone the new repository and do work on that, instead of using the one used for the migration.
Now you can look forward and enjoy the Git way of working. If you are unsure on how to use Git, Atlassian wrote a great article introducing some Git workflows.