Recover most of a broken Git repository
written on Monday, October 24, 2016
WARNING: The steps in this blog post deliberately modify the history of a Git repository and even get rid of some information in the repository. If you are fortunate enough to have a backup somewhere refer to the excellent write-up from Linus Torvalds on how to recover broken blob objects.
Preface
A colleague approached me and said:
I have this weird Git repository which I can commit to and work with it, but I can't push it to a remote Git server. The Git tools complain about damaged objects in the repository.
After a brief discussion, it turned out that this is the only available copy of the repository. There are no off-site backups and no remote repositories available. Furthermore, about a year ago, the hard drive where this Git repository used to reside crashed and a lot of data was lost.
This might get interesting.
Investigating the current state
The colleague gave me a copy of the Git repository and I started the investigation. Git allows one to check the current state of the repository using git fsck:
Ouch. The man page of git gc states the following about sha1 mismatches:
The database has an object who’s sha1 doesn't match the database value. This indicates a serious data integrity problem.
Fix (most of) the repository
Let's move the broken file somewhere else and run git fsck again:
Now we know which tree object is affected. Fortunately, the damage seems fairly limited as only one tree object refers to the broken blob object. We only know the hash of the broken blob object but do not know the actual filename. The tree object hash may be used to find out to what file in the repository the broken blob object refers to:
OK, main.c is broken, but which commit points to the tree object? git log to the rescue:
The 2nd column is the hash of the faulty commit: e0a7722fd9aa94f632ad6427d62189a3ae2b8de5. This commit is more than a year old. It really seems that the blob object got damaged during the hard drive crash. Let's find the commit before and after the faulty commit and take a look at the differences between them:
Oh, that's a lot! How about the diff for main.c only?
This does not look any better. As there is no other backup of main.c from this particular point in time, we might just kick out the faulty commit entirely. After all, having a commit where an important file is missing is not helpful either. Even worse, the project can't be built from this commit at all. In this particular situation, we can get rid of the faulty commit and "fix" the repository that way.
One might think of interactive rebasing in this situation, but it turns out that it does not work due to the broken object database. Another way to solve this problem is to use graft points (grafts). The description from the Git wiki sounds promising:
It works by letting users record fake ancestry information for commits. This way you can make Git pretend the set of parents a commit has is different from what was recorded when the commit was created.
Currently, the Git repository looks like this:
x --- 75199fa --- 4fb7737 --- e0a7722 --- b87625a --- 85263b1 --- x [before] [faulty] [after]
And it should look like this:
/------------------\ / \ x --- 75199fa --- 4fb7737 e0a7722 b87625a --- 85263b1 --- x [before] [faulty] [after]
Let's assign the commit b87625a its new parent commit 4fb7737 using the grafts file:
Apply the changes permanently:
This Git repository is now somewhat dirty and we would like to have a clean copy of it. The easiest way to accomplish this is to simply clone the repository locally (see man page of git filter-branch):
Looks good, but what about git fsck and the history?
Done, now the Git object database is in a consistent state again and it is possible to push it to a remote Git server.