What the heck is git doing?

A short stroll through what git is doing behind the scenes.

First: this article is not a detailed look at git. It's a very light introduction to what git is doing behind the scenes. Check out the bibliography/links section at the bottom of the article for more detailed information!

Let's begin -- git is basically doing 4 things:

creating hashes of files
compressing files
storing files
generating various types of maps to these files

Knowing this, let's walk through what some of the basic commands are doing behind the scenes!

git init / git clone

'git init' just creates a .git/ folder with an index file and objects/ folder. There are plenty of other files, but those are the basics.

The index is just a map of filenames to the files computed hash, and the objects/ folder contains every file ever added to the repository (and by "every file," I mean every version of every file).

'git clone' creates this directory, and then grabs all the files in the .git/objects/ folder and the .git/index file (plus all the other files we're not talking about).

git add

Once files in your working directory (everything that is not the .git/ folder in the directory that contains the .git/ folder) have been changed, you may run the 'git add' command. This command does 3 primary things:

hashes changed files
add entries to the .git/index file
compress and store those files in the .git/objects/ directory

This is called STAGING.

A this point git has copies of the files being versioned, but it doesn't really know their place in the repository's history, which is where 'git commit' comes into play. However, what if you mess up and stage a file that you don't want staged? 'git reset' to the rescue!

git reset

If you think about what 'git add' is doing, 'git reset' does the opposite! It basically just deletes the hashed files and index entries for files that have not yet been committed (so they aren't stamped into git history) but do have an entry in the .git/index file and have copies in the .git/objects/ folder.

git commit -m "kewl"

'git commit' takes the files with newer hashes, plus all the other hashes from the latest commit, and creates a "tree graph" object from this -- which is a file that contains a list of file hashes or other tree graph hashes. In this way, it mirrors the directory structure of your working directory, but instead of storing filenames, it stores hashes of those files. This lets a tree graph object describe the state of a repository at a particular version without having to store duplicates of any files.

After it generates the tree graph, 'git commit' creates a second file called the "commit object." This object stores a reference to the tree graph, the last commit (if there was one), and some other information (like author and memo) about the commit.

At this point, the history can be updated -- so refs (basically "branches") are updated to point to the most current commit object, which is the one just created.

git pull [--rebase]

Using the knowledge gained from the previous sections, let's examine how the 'git push' command works!

'git pull' by itself with go out to the remote repository and fetch all the objects (hashed files, tree graphs, and commit objects), placing them "on top" of any new commits found locally, which will require merging with the working copy, staging, and then committing. The 'git pull' command does all this automatically, unless it runs into merge issues. This is why you'll often get a prompt of some sort to modify or add a commit message for the merge!

Using the 'rebase' switch (e.g. 'git pull --rebase') goes through a similar process, but instead of inserting remote commits "on top" of your local commits, it will attempt to instead insert new remote commits before any local commits that are not on the remote server, then apply your new local commits on top of the remote commits.

That's a mouthful, but the end result is that the rebase modifies local history to match the remote repository, and only then applies local changes to the repository.

This tends to reduce "noise" in the commit history by removing uneeded commit objects/tree graphs describing a merge.

git push

A 'git push' is used to synchronise remote state with new state created locally. It copies all tree graph, commit, and file objects to the remote repository.

Git only likes doing operations locally, as such, 'git push' will error if the remote repository contains objects that the local repository does not contain.

The end.

There's a LOT more that git does, but what I've described above is a simple model of what git is doing behind the scenes. This sort of mental model can help greatly when trying to figure out more complex operations in git!

Here's some links to read and view for developing a more detailed understanding of git:

The Git Book: http://www.git-scm.com/book/en/v2
Git from the inside out: https://codewords.recurse.com/issues/two/git-from-the-inside-out
Git from the bottom up: https://jwiegley.github.io/git-from-the-bottom-up/
Advanced Git: https://www.youtube.com/watch?v=8ET_gl1qAZ0
Git Real: https://www.codeschool.com/courses/git-real
Git Real-2: https://www.codeschool.com/courses/git-real-2