author-pic

Amila Senadheera

Tech enthusiast

Unravelling "git commit"


Published on September 27, 2023

I assume that everyone is familiar with using Git (tooling) and GitHub (a hosted service for repositories) or will encounter them at some point in their career. Git is a distributed version control system, unlike SVN. You've probably executed the git commit command when combining your related file changes. Maybe your favorite IDE makes it even easier with a few button clicks and a UI input for the commit message. However, it's interesting to explore what Git does when you run it.

The git commit command is a 'Main Porcelain Command.' It is a high-level command recommended for day-to-day use. Let's take a look at some low-level git commands to achieve the exact same result and better understand the inner workings of Git.

It might seem like the Git repository is just a file system, but it's not. Your working directory is a file system, but how Git stores your files and changes is very different from a regular version-controlled system. Is it a B-tree or an LSM tree as in database systems? Not at all.

Let's initialize a repo:

git init -b main ./my-new-repo

This command creates an empty directory called my-new-repo and initializes a Git repository. Git creates a hidden folder called .git where all the version control changes are stored. You continue working in the directory, and from time to time, you combine your changes and store them in the repository folder. So, .git is the crucial part. Let's take a closer look at its organization.

cd my-new-repo
tree .git

Output:

.git
├── HEAD
├── config
├── description
├── hooks
│   ├── ...
│   └── update.sample
├── info
│   └── exclude
├── objects
│   ├── info
│   └── pack
└── refs
    ├── heads
    └── tags

This is what an empty initialized repository looks like. The objects directory is where your changes go. Let's add a file:

echo "My version controlled file" > my-file

If you check the directory tree again as we did above, there are no changes yet.

git add

To do that, we need to stage the files we are interested in:

git add my-file

Let's check what happened to .git:

├── objects
│   ├── 0a
│   │   └── d2eaec86211bab157fb4ca0f2bcd099099b660
│   ├── info
│   └── pack

You should see the exact same changes if you add the same file content. It stores my-file in the Git object store (which is the ./git/objects directory) as a Blob (Binary Large Object).

The 0ad2eaec86211bab157fb4ca0f2bcd099099b660 is the SHA1 hash of the file content. Git doesn't care about the file name when storing them. However, the file name is important when displaying your file in the working directory. Also, note that Git uses the first two characters of the SHA1 hash (0a) and uses the rest of the characters as the object name of the Blob. This is done to distribute files among directories and efficiently store a large number of files.

Let's check the object type:

git cat-file -t 0ad2eaec
blob

We used the first few characters of the SHA1 hash in the above command.

What's in it:

git cat-file -p 0ad2eaec
My version controlled file

git write-tree

So, we've successfully stored the blob type object in the store. However, this is not enough to track the changes. To store the file tree changes, you need to run:

git write-tree
e0aaef45a7b278c599211c06428d53d3d128749a

You will always get the exact same output because the hash always reproduces the same result.

Let's check the object store again:

├── objects
│   ├── 0a
│   │   └── d2eaec86211bab157fb4ca0f2bcd099099b660
│   ├── e0
│   │   └── aaef45a7b278c599211c06428d53d3d128749a
│   ├── info
│   └── pack

Now we have one more object in the object store.

Let's check what the object type is:

git cat-file -t e0aaef45
tree

(We used the first few characters of the SHA1 hash in the above command.)

What's in it:

git cat-file -p e0aaef45
100644 blob 0ad2eaec86211bab157fb4ca0f2bcd099099b660	my-file

It consists of the root directory content as a list of records. You can see that it has referred to the SHA1 of my-files content we previously stored. This format is very simple. 100644 is the mode, and it's for a normal object at this time. blob is the type of the object in the file tree. Next is the hash, and then the file name my-name goes at the end. A tree type object only stores information for a single directory level, so it can record 'blob's in that directory and tree objects for directories within it.

Okay, great! We've encountered two types of objects so far: blob and tree. Note that the tree object includes the hashes in that directory tree, so the hash of the tree object includes all the content signatures hashes in the tree object hash. That means if you rename the file or its content, a new tree object has to be regenerated.

Also, note that if you store the same file content with two names, it will only be stored once in the object store because the content is the same but in two trees. But what about if you change a single character of the file? Will it generate a new blob? Yes, it will. But it is inefficient, right? Yes, that's correct too. However, Git stores files using lossless compression, and it's very efficient. Let's stay on track with the title of this article.

git commit-tree

Now we have created a tree object for the root directory. Actually, commit is just another object type stored in Git's object store. You might have already guessed that it has a reference to the tree object at the root level of the directory. Let's create it:

git commit-tree e0aaef45a7b278c599211c06428d53d3d128749a -m "Adding version controlled file"
9e14e44d7dd72f8eeb12e4a6c8305e0cda7619af

This commit hash should be different for you because its content includes the author (who composes the commit), committer (who puts it into the repo), and timestamps for when these actions occurred.

So, there are now three objects in total:

├── objects
│   ├── 0a
│   │   └── d2eaec86211bab157fb4ca0f2bcd099099b660
│   ├── 9e
│   │   └── 14e44d7dd72f8eeb12e4a6c8305e0cda7619af
│   ├── e0
│   │   └── aaef45a7b278c599211c06428d53d3d128749a
│   ├── info
│   └── pack

Let's confirm its type:

git cat-file -t 9e14e44d7dd
commit

What's inside the commit object:

git cat-file -p 9e14e44d7dd
tree e0aaef45a7b278c599211c06428d53d3d128749a
author amila <amila.15@cse.mrt.ac.lk> 1695832070 +0530
committer amila <amila.15@cse.mrt.ac.lk> 1695832070 +0530

Adding version controlled file

We can depict all the objects we have in the object store as follows:

  • Blob objects are represented using squares.
  • Tree objects are represented using triangles.
  • Commit objects are represented using circles.
"My version controlled file"blob0ad2eaec0ad2eaec my-filetree e0aaef45author a@b.comcommitter a@b.comtreee0aaef45commit9e14e44dblob0ad2eaec

git commit

Now it's clear that the combination of the git write-tree and git commit-tree commands together produces the output of the git commit command.

Let's go one step further. If you copy my-file into a directory called copied-files and then follow the exact same commands as above:

.
├── copied-files
│   └── my-file
└── my-file

Now the object store has five objects in total:

├── objects
│   ├── 0a
│   │   └── d2eaec86211bab157fb4ca0f2bcd099099b660
│   ├── 5a
│   │   └── 00e80086517be6481d428cc13c0d11ad3d3791
│   ├── 9e
│   │   └── 14e44d7dd72f8eeb12e4a6c8305e0cda7619af
│   ├── b6
│   │   └── e45dfc14712993a11554a154dc94bb8caa3cb3
│   ├── e0
│   │   └── aaef45a7b278c599211c06428d53d3d128749a
│   ├── info
│   └── pack

Then you'll have an object store with objects linked as shown below:

"My version controlled file"blob0ad2eaec0ad2eaec my-filetree e0aaef45author a@b.comcommitter a@b.comtreee0aaef450ad2eaec my-filee0aaef45 copied-filestreeb6e45dfctree b6e45dfcauthor a@b.comcommitter a@b.comcommit5a00e80commit9e14e44dblob0ad2eaec

Notice that the second commit has a new root tree object, and it points back to the previous tree (at this point, to refer as the content in the new directory we created) and the blob object, reusing them. This entire structure is a Directed Acyclic Graph (DAG), which means you can order all the commit, tree, and blob objects in a line so that all the reference links point in the same direction. Starting from a commit and traversing the sub-graph in the DAG, you can recreate the working directory file tree for that instance. If you start from the second commit, you will encounter the same blob twice since it's a duplicated file.

In fact, a Git branch or tag is just a pointer to a commit object. So, it is a very inexpensive operation to create a new branch.

Nice, now Git commit doesn't have to be something that keeps you awake at night!

Happy Learning!

If you like it, share it!


Created by potrace 1.16, written by Peter Selinger 2001-2019 © 2024 Developer Diary.

Made withusing Gatsby, served to your browser from a home grown Raspberry Pi cluster.
contact-me@developerdiary.me