Introduction To Git

I'm still writing this and will keep writing for a few years as this will be my go-to reference for all problems I faced with Git :)

History of Git

In order to understand a technology first is important to understand the historical context that led to the creation of such technology. The fundamental problem that Git and other tools tried to solve is how to manage different versions of files, which is called Version Control.

Version Control

People that start learning programming and are not familiar with version control will copy their files into different folders as a way of keeping different versions of the same thing. This process is error prone and not efficient. You're copying the whole file, instead you could save just the difference between two files and this would still allow you to reconstruct any given version. When you start collaborating with others there are more problems. You can share the same file using a shared folder on the network or you need to send each other files constantly. Both solution suck and they have the problem that it doesn't allow two or more people to edit the same file without conflicts.

The initial solution for this problem was to use a Centralized Version Control System. These systems have a central server that has all the versions and clients checkout files from the server. Instead of figuring out a way that allow two or more people to work at the same file the solution adopted was to block such file. One developer would block the files that he was going to be editing and nobody else could send a new version of that file.

I had the pleasure of working in such a system for a while and it sucks, I had to send emails for people because someone blocked a file that I needed to edit and left on vacation. This centralized system also had to problem of being centralized, if there was any problem with the server nobody could retrieve the files or send new versions.

How Linux Did Version Control

Linux is an excellent example because it's the biggest open source software with contributors all over the world. It was thanks to Linux that Git was created so it's worth to know how the hell they were able to have version control before Git.

Diff, Patch, and Tarball

For the first 10 years the Kernel used tarballs and patches. What is that? I said before that copying files is inefficient. Imagine if they had to send over email or FTP the entire codebase for every change. It would be a nightmare. Before we go to the solution is interesting to know two three tools: diff, patch and tar. diff is able to tell the difference between two files:

echo "Hello\nMy name is Linus\nI'm a developer\nI've created Linux and Git\nHow are you doing?\na\nb\nc\nd\ne" > hello_v1
echo "Hello\nMy name is Marcelo\nI'm a developer\nI've created absolutely nothing\nHow are you doing?\na\nb\nc\nd\ne" > hello_v2

diff -u hello_v1 hello_v2

Which provides the output:

--- hello_v1	2021-12-13 21:01:02.521530015 -0300
+++ hello_v2 2021-12-13 21:01:08.021516740 -0300
@@ -1,7 +1,7 @@
Hello
-My name is Linus
+My name is Marcelo
I'm a developer
-I've created Linux and Git
+I've created absolutely nothing
How are you doing?
a
b

Notice that we appear to have the whole file at the beginning, but the last three lines c, d, and e are not present. This is because we're actually just sending the difference, but it needs some lines around what changed to be able to infer the context. It almost goes without saying, but this tool it's super performative. It's used together with patch to effectively apply the change:

diff -u hello_v1 hello_v2 > my_patch.patch
patch < my_patch.patch

Then if we look at hello_v1 we have:

Hello
My name is Marcelo
I'm a developer
I've created absolutely nothing
How are you doing?
a
b
c
d
e

Finally tar which saves many files together into a single tape or disk archive, and can restore individual files from the archive. Usage:

tar -cf archive.tat file1 file2

Linus would put a version of the Linux Kernel in a tar format somewhere public were people could download it. Developers would download this tar, develop their functionalities, perform a diff of the whole thing and send this diff as an attachment to Linus. Linus then would review, apply these diff using patch, and if he's happy with the code he create a new tar increasing the version and publish it.

BitKeeper

A solution for growing pains is a very interesting email that describe how the process above didn't scale. Eventually Linus decided to use the proprietary software BitKeeper. They had a license for the community version which allowed the usage of the software for open source projects, provided developers did not participate in the development of a competing tool. This was around 2000. Many people in the community were unhappy on using a proprietary tools, Richard Stallman wrote in an e-mail:

The spirit of the Bitkeeper license is the spirit of the whip hand. It is the spirit that says, "You have no right to use Bitkeeper, only temporary privileges that we can revoke. Be grateful that we allow you to use Bitkeeper. Be grateful, and don't do anything we dislike, or we may revoke those privileges." It is the spirit of proprietary software. Every non-free license is designed to control the users more or less. Outrage at this spirit is the reason for the free software movement. (By contrast, the open source movement prefers to play down this same outrage.) If the latest outrage brings the spirit of the non-free Bitkeeper license into clear view, perhaps that will be enough to convince the developers of Linux to stop using Bitkeeper for Linux development.

On 2005 BitMover (the company behind BitKeeper) announced that would terminate its free license. The reason given was that Andrew Tridgell, author of Samba and co-inventor of rsync, was working on a client that would show the metadata (data about revisions) and this was only available for commercial licenses.

Linus decides to write his own version control

After releasing kernel 2.6 Linus wanted to solve this version control system problem once and for all. He looked over other version control systems but couldn't find a solution he was happy. His requirements:

  • must be distributed
  • must not have performance problems
  • must guarantee that whatever gets added it's retrieved exactly the same

And with that we have our historical context.

Git

Now let's take a look at Git in general. These concepts are valuable.

Snapshots

Other Version Control Systems think of version based on what changed in each version:

Deltas in VCS (Source: Git Book)


Git on the other hand think in terms of snapshots:

Snapshots in Git (Source: Git Book)

Integrity

Git checksums everything before performing an operation and uses the result of the checksum as a reference for that operation. The commit code is a SHA-1 of the content of the commit and it looks something like this: 973a3da3b6f158087371ea5b37fcab79c9af3b09.

This ensures data integrity when writing and reading. If the underline file gets corrupted the SHA-1 won't match and Git will warn you about this. A side result of this approach is that it also makes Git secure, but as Linus himself said this was never the main goal. The mail goal is to not have to worry about data integrity.

Configuration

The configuration can be stored in three places:

  • /etc/gitconfig: every user of the system. When using git config use the --system option to read or write to this file.
  • ~/.gitconfig or ~/.config/git/config: user wide settings, use the --global option.
  • .git/config: local for the specific repository, use the --local option which is the default.

Each level overrides values in the previous level. There are several options, these are the ones I use:

[user]
name = Marcelo Fernandes
email = marcelo.schreiber@gmail.com
[init]
defaultBranch = main
[core]
editor = vim
[commit]
verbose = true # shows a diff when doing a git commit, I find it useful because it forces me to do some craftsmanship on my commits
[sequence]
editor = rebase-editor
[merge]
tool = meld
[help]
autocorrect = 1 # if I type something wrong like git checout it will execute the command git checkout
[alias]
co = checkout
br = branch
ci = commit
st = status
unstage = reset HEAD --

.gitignore

  • Can start patterns with / to avoid recursivity
  • Can end patterns with / to specify a directory
  • Can negate a pattern by starting it with !
  • * matches zero or more characters
  • ** match nested directories
  • [abc] matches any character inside the brackets
  • ? matches a single character
  • [0-9] matches characters between 0 and 9

It's possible to have .gitignore on subdirectories, the rules apply only on those repositories and below it.

Commands

Adding files

  • git add -p file1: choose portions of code that are going to be commited, useful when you don't want to commit all the changes you made to the file
  • git add -u: when you want to add all modifications but there is an untracked file. You can't do git add . because it will add the untracked file. git add -u solves that

Removing files

Instead of deleting the file I prefer to use git rm README. If I want to keep the file around then use the --cached option.

Moving files

This is a life saver when you want to change the casing of a file. git mv README readme, renames from README to readme.

Viewing history

The basic usage is git log which list the entire history in descending order. Some interesting options:

  • git log -2: lists the last two commits
  • git log -p: lists the diff of each commit
  • git log --stat: shows the list of files that were changed as well and the number of lines that were added or removed (I'm a fan of this one)
  • git log --pretty=oneline: allows different visualization modes. You can replace oneline with short, full, and fuller
  • git log --since="10 day ago": accepts a variety of formats, it's so good that I don't have to consult anything and it just works. Can also use --after
  • git log --until="1 year ago": same as above, can also use --before
  • git log -S function_name: searches for commits that modified the function_name
  • git log -- path/to/file: only commits that modified the file or directory
  • git log --graph: adds an ASCII graph showing branches and the merge history

Doing Stuff

Changing the last commit

git commit --amend takes whatever is in the staging area and add it to the previous commit. If you run this right after the previous commit, than the only thing that is going to change is the commit message if you want to.

See diff of a specific commit

git diff <commit-sha> -p shows the changes performed by the commit

Reset branch to be exactly the same as origin

git reset --hard origin/<branch-name>

External Tools

Rebase Editor

rebase-editor is a tool that helps with rebasing.

Meld

Meld is my go to visual merger tool. Whenever I have a conflict that I need to visually see to understand what is going on I just have to type: git mergetool.

References