nblock's ~

Recover most of a broken Git repository

WARNING: The steps in this blog post deliberately modify the history of a Git repository and even get rid of some information in the repository. If you are fortunate enough to have a backup somewhere refer to the excellent write-up from Linus Torvalds on how to recover broken blob objects.

Preface

A colleague approached me and said:

I have this weird Git repository which I can commit to and work with it, but I can't push it to a remote Git server. The Git tools complain about damaged objects in the repository.

After a brief discussion, it turned out that this is the only available copy of the repository. There are no off-site backups and no remote repositories available. Furthermore, about a year ago, the hard drive where this Git repository used to reside crashed and a lot of data was lost.

This might get interesting.

Investigating the current state

The colleague gave me a copy of the Git repository and I started the investigation. Git allows one to check the current state of the repository using git fsck:

$ cd working
$ git fsck
error: inflate: data stream error (invalid distance too far back)
error: sha1 mismatch 25c49e20b0c3eca36713a9cb7a21b25a172f7b0d
error: 25c49e20b0c3eca36713a9cb7a21b25a172f7b0d: object corrupt or missing
Checking object directories: 100% (256/256), done.
missing blob 25c49e20b0c3eca36713a9cb7a21b25a172f7b0d
...

Ouch. The man page of git gc states the following about sha1 mismatches:

The database has an object who’s sha1 doesn't match the database value. This indicates a serious data integrity problem.

Fix (most of) the repository

Let's move the broken file somewhere else and run git fsck again:

$ mv .git/objects/25/c49e20b0c3eca36713a9cb7a21b25a172f7b0d /tmp
$ git fsck
Checking object directories: 100% (256/256), done.
broken link from    tree ef557adecb5ed7e114d93a7a9a82cbf4b0cd30f1
              to    blob 25c49e20b0c3eca36713a9cb7a21b25a172f7b0d
missing blob 25c49e20b0c3eca36713a9cb7a21b25a172f7b0d

Now we know which tree object is affected. Fortunately, the damage seems fairly limited as only one tree object refers to the broken blob object. We only know the hash of the broken blob object but do not know the actual filename. The tree object hash may be used to find out to what file in the repository the broken blob object refers to:

$ git ls-tree ef557adecb5ed7e114d93a7a9a82cbf4b0cd30f1 | \
    grep 25c49e20b0c3eca36713a9cb7a21b25a172f7b0d
100644 blob 25c49e20b0c3eca36713a9cb7a21b25a172f7b0d    main.c

OK, main.c is broken, but which commit points to the tree object? git log to the rescue:

$ git log --pretty=format:"%T %H" | grep ef557adecb5ed7e114d93a7a9a82cbf4b0cd30f1
ef557adecb5ed7e114d93a7a9a82cbf4b0cd30f1 e0a7722fd9aa94f632ad6427d62189a3ae2b8de5

The 2nd column is the hash of the faulty commit: e0a7722fd9aa94f632ad6427d62189a3ae2b8de5. This commit is more than a year old. It really seems that the blob object got damaged during the hard drive crash. Let's find the commit before and after the faulty commit and take a look at the differences between them:

$ git log --pretty=format:"%H - %h - %s"
...
85263b18e04e7dc0473f4c4501d366d389bf6e01 - 85263b1 - [redacted]
b87625ab801db7d3452746ca8cc2a1f4137ed924 - b87625a - [commit after, redacted]
e0a7722fd9aa94f632ad6427d62189a3ae2b8de5 - e0a7722 - [faulty commit, redacted]
4fb7737350cc0b646177cb0a041fc73422ffc98a - 4fb7737 - [commit before, redacted]
75199fa2dab382cbf3395be0a696e72e884163b0 - 75199fa - [redacted]
...

$ git diff --shortstat 4fb7737..b87625a
9 files changed, 1444 insertions(+), 1056 deletions(-)

Oh, that's a lot! How about the diff for main.c only?

$ git diff --shortstat 4fb7737..b87625a -- main.c
1 file changed, 977 insertions(+), 857 deletions(-)

This does not look any better. As there is no other backup of main.c from this particular point in time, we might just kick out the faulty commit entirely. After all, having a commit where an important file is missing is not helpful either. Even worse, the project can't be built from this commit at all. In this particular situation, we can get rid of the faulty commit and "fix" the repository that way.

One might think of interactive rebasing in this situation, but it turns out that it does not work due to the broken object database. Another way to solve this problem is to use graft points (grafts). The description from the Git wiki sounds promising:

It works by letting users record fake ancestry information for commits. This way you can make Git pretend the set of parents a commit has is different from what was recorded when the commit was created.

Currently, the Git repository looks like this:

x --- 75199fa --- 4fb7737 --- e0a7722 --- b87625a --- 85263b1 --- x
                  [before]    [faulty]    [after]

And it should look like this:

                       /------------------\
                      /                    \
x --- 75199fa --- 4fb7737     e0a7722     b87625a --- 85263b1 --- x
                  [before]    [faulty]    [after]

Let's assign the commit b87625a its new parent commit 4fb7737 using the grafts file:

$ mkdir .git/info
$ echo "b87625ab801db7d3452746ca8cc2a1f4137ed924 4fb7737350cc0b646177cb0a041fc73422ffc98a" \
    > .git/info/grafts

Apply the changes permanently:

$ git filter-branch -- --all

This Git repository is now somewhat dirty and we would like to have a clean copy of it. The easiest way to accomplish this is to simply clone the repository locally (see man page of git filter-branch):

$ cd ..
$ git clone working clean
$ Cloning into 'clean'...
$ done.

Looks good, but what about git fsck and the history?

$ cd clean
$ git fsck
$ Checking object directories: 100% (256/256), done.

$ git log --pretty=format:"%H - %h - %s"
...
161ef4045da4cc6750599a11447380a76ca017b6 - 161ef40 - [redacted (new hash)]
c56d22e9da16aa59fa5fb47d11e4c2c930d1b583 - c56d22e - [commit after, redacted (new hash)]
4fb7737350cc0b646177cb0a041fc73422ffc98a - 4fb7737 - [commit before, redacted]
75199fa2dab382cbf3395be0a696e72e884163b0 - 75199fa - [redacted]
...

Done, now the Git object database is in a consistent state again and it is possible to push it to a remote Git server.

Credits

  • Lukas for reviewing this blog post and his valuable input.

Go and create off-site backups!


permalink

tagged disaster recovery and git