Unravelling "git commit"
Published on September 27, 2023
I assume that everyone is familiar with using Git (tooling) and GitHub (a hosted service for repositories) or will encounter them at some point in their career. Git is a distributed version control system, unlike SVN. You've probably executed the git commit
command when combining your related file changes. Maybe your favorite IDE makes it even easier with a few button clicks and a UI input for the commit message. However, it's interesting to explore what Git does when you run it.
The git commit
command is a 'Main Porcelain Command.' It is a high-level command recommended for day-to-day use. Let's take a look at some low-level git commands to achieve the exact same result and better understand the inner workings of Git.
It might seem like the Git repository is just a file system, but it's not. Your working directory is a file system, but how Git stores your files and changes is very different from a regular version-controlled system. Is it a B-tree or an LSM tree as in database systems? Not at all.
Let's initialize a repo:
git init -b main ./my-new-repo
This command creates an empty directory called my-new-repo
and initializes a Git repository. Git creates a hidden folder called .git
where all the version control changes are stored. You continue working in the directory, and from time to time, you combine your changes and store them in the repository folder. So, .git
is the crucial part. Let's take a closer look at its organization.
cd my-new-repo
tree .git
Output:
.git
├── HEAD
├── config
├── description
├── hooks
│ ├── ...
│ └── update.sample
├── info
│ └── exclude
├── objects
│ ├── info
│ └── pack
└── refs
├── heads
└── tags
This is what an empty initialized repository looks like. The objects
directory is where your changes go. Let's add a file:
echo "My version controlled file" > my-file
If you check the directory tree again as we did above, there are no changes yet.
git add
To do that, we need to stage the files we are interested in:
git add my-file
Let's check what happened to .git
:
├── objects
│ ├── 0a
│ │ └── d2eaec86211bab157fb4ca0f2bcd099099b660
│ ├── info
│ └── pack
You should see the exact same changes if you add the same file content. It stores my-file
in the Git object store (which is the ./git/objects
directory) as a Blob (Binary Large Object).
The 0ad2eaec86211bab157fb4ca0f2bcd099099b660 is the SHA1 hash of the file content. Git doesn't care about the file name when storing them. However, the file name is important when displaying your file in the working directory. Also, note that Git uses the first two characters of the SHA1 hash (0a
) and uses the rest of the characters as the object name of the Blob. This is done to distribute files among directories and efficiently store a large number of files.
Let's check the object type:
git cat-file -t 0ad2eaec
blob
We used the first few characters of the SHA1 hash in the above command.
What's in it:
git cat-file -p 0ad2eaec
My version controlled file
git write-tree
So, we've successfully stored the blob
type object in the store. However, this is not enough to track the changes. To store the file tree changes, you need to run:
git write-tree
e0aaef45a7b278c599211c06428d53d3d128749a
You will always get the exact same output because the hash always reproduces the same result.
Let's check the object store again:
├── objects
│ ├── 0a
│ │ └── d2eaec86211bab157fb4ca0f2bcd099099b660
│ ├── e0
│ │ └── aaef45a7b278c599211c06428d53d3d128749a
│ ├── info
│ └── pack
Now we have one more object in the object store.
Let's check what the object type is:
git cat-file -t e0aaef45
tree
(We used the first few characters of the SHA1 hash in the above command.)
What's in it:
git cat-file -p e0aaef45
100644 blob 0ad2eaec86211bab157fb4ca0f2bcd099099b660 my-file
It consists of the root directory content as a list of records. You can see that it has referred to the SHA1 of my-file
s content we previously stored. This format is very simple. 100644
is the mode, and it's for a normal object at this time. blob
is the type of the object in the file tree. Next is the hash, and then the file name my-name
goes at the end. A tree
type object only stores information for a single directory level, so it can record 'blob's in that directory and tree
objects for directories within it.
Okay, great! We've encountered two types of objects so far: blob
and tree
. Note that the tree
object includes the hashes in that directory tree, so the hash of the tree
object includes all the content signatures hashes in the tree
object hash. That means if you rename the file or its content, a new tree
object has to be regenerated.
Also, note that if you store the same file content with two names, it will only be stored once in the object store because the content is the same but in two trees. But what about if you change a single character of the file? Will it generate a new blob
? Yes, it will. But it is inefficient, right? Yes, that's correct too. However, Git stores files using lossless compression, and it's very efficient. Let's stay on track with the title of this article.
git commit-tree
Now we have created a tree
object for the root directory. Actually, commit
is just another object type stored in Git's object store. You might have already guessed that it has a reference to the tree
object at the root level of the directory. Let's create it:
git commit-tree e0aaef45a7b278c599211c06428d53d3d128749a -m "Adding version controlled file"
9e14e44d7dd72f8eeb12e4a6c8305e0cda7619af
This commit hash should be different for you because its content includes the author (who composes the commit), committer (who puts it into the repo), and timestamps for when these actions occurred.
So, there are now three objects in total:
├── objects
│ ├── 0a
│ │ └── d2eaec86211bab157fb4ca0f2bcd099099b660
│ ├── 9e
│ │ └── 14e44d7dd72f8eeb12e4a6c8305e0cda7619af
│ ├── e0
│ │ └── aaef45a7b278c599211c06428d53d3d128749a
│ ├── info
│ └── pack
Let's confirm its type:
git cat-file -t 9e14e44d7dd
commit
What's inside the commit
object:
git cat-file -p 9e14e44d7dd
tree e0aaef45a7b278c599211c06428d53d3d128749a
author amila <amila.15@cse.mrt.ac.lk> 1695832070 +0530
committer amila <amila.15@cse.mrt.ac.lk> 1695832070 +0530
Adding version controlled file
We can depict all the objects we have in the object store as follows:
- Blob objects are represented using squares.
- Tree objects are represented using triangles.
- Commit objects are represented using circles.
git commit
Now it's clear that the combination of the git write-tree
and git commit-tree
commands together produces the output of the git commit
command.
Let's go one step further. If you copy my-file
into a directory called copied-files
and then follow the exact same commands as above:
.
├── copied-files
│ └── my-file
└── my-file
Now the object store has five objects in total:
├── objects
│ ├── 0a
│ │ └── d2eaec86211bab157fb4ca0f2bcd099099b660
│ ├── 5a
│ │ └── 00e80086517be6481d428cc13c0d11ad3d3791
│ ├── 9e
│ │ └── 14e44d7dd72f8eeb12e4a6c8305e0cda7619af
│ ├── b6
│ │ └── e45dfc14712993a11554a154dc94bb8caa3cb3
│ ├── e0
│ │ └── aaef45a7b278c599211c06428d53d3d128749a
│ ├── info
│ └── pack
Then you'll have an object store with objects linked as shown below:
Notice that the second commit
has a new root tree
object, and it points back to the previous tree
(at this point, to refer as the content in the new directory we created) and the blob
object, reusing them. This entire structure is a Directed Acyclic Graph (DAG), which means you can order all the commit
, tree
, and blob
objects in a line so that all the reference links point in the same direction. Starting from a commit
and traversing the sub-graph in the DAG, you can recreate the working directory file tree for that instance. If you start from the second commit, you will encounter the same blob twice since it's a duplicated file.
In fact, a Git branch
or tag
is just a pointer to a commit
object. So, it is a very inexpensive operation to create a new branch.
Nice, now Git commit doesn't have to be something that keeps you awake at night!
Happy Learning!
If you like it, share it!