Git Architecture and Objects

The Hook (The "Byte-Sized" Intro)

Every file you commit, every folder structure, every snapshot — Git doesn't store them as files. It stores them as objects in a content-addressable database, each identified by a unique 40-character fingerprint. It's like a library where every book has a barcode generated from its contents. Change one word, and the barcode changes. That's how Git keeps history tamper-proof.

📖 What is Git Architecture and Objects?

Under the hood, Git is a content-addressable filesystem — a simple but powerful key-value store. It stores all data as four types of objects, each identified by a SHA-1 hash. Understanding these objects makes Git's behavior predictable and debuggable.

Conceptual Clarity

Git has exactly 4 types of objects:

Object	What It Stores	Analogy
Blob	Raw file contents (no filename, no path)	A page of text
Tree	Directory listing — maps filenames to blobs and sub-trees	A table of contents
Commit	Snapshot — points to a tree + author + message + parent commit(s)	A dated, signed photo
Tag	A named, annotated pointer to a commit	A sticky note on a photo

How they connect:

A commit points to a tree (the root folder)
A tree points to blobs (files) and other trees (subfolders)
Each object is identified by a SHA-1 hash — a 40-character fingerprint generated from the object's contents

Key insight: If two files have identical content, Git stores only ONE blob. Same content = same hash = same object. This is how Git stays efficient.

Real-Life Analogy

Think of a Git repository as a library's catalog system:

Blobs = The actual pages of text (content only, no titles)
Trees = The shelf index cards (which book is on which shelf)
Commits = The library log entries ("On Jan 5, Alice added 3 books to Shelf B")
SHA-1 hashes = The ISBN barcodes (generated from content — if content changes, ISBN changes)

Visual Architecture

flowchart TD C1["📸 Commit: a1b2c3d Author: Alice Message: Add homepage Parent: e4f5g6h"] --> T1["📁 Tree: f8d9e1a"] T1 --> B1["📄 Blob: 3c7a8b2 index.html"] T1 --> B2["📄 Blob: 9d4e5f6 style.css"] T1 --> T2["📁 Tree: b2c3d4e images/"] T2 --> B3["📄 Blob: 1a2b3c4 logo.png"] C1 --> C0["📸 Parent Commit: e4f5g6h"] style C1 fill:#0f3460,stroke:#53d8fb,color:#53d8fb style C0 fill:#1a1a2e,stroke:#53d8fb,color:#53d8fb style T1 fill:#1a1a2e,stroke:#ffd700,color:#ffd700 style T2 fill:#1a1a2e,stroke:#ffd700,color:#ffd700 style B1 fill:#1a1a2e,stroke:#e94560,color:#e94560 style B2 fill:#1a1a2e,stroke:#e94560,color:#e94560 style B3 fill:#1a1a2e,stroke:#e94560,color:#e94560

Why It Matters

Integrity: SHA-1 hashes mean any corruption is detectable. You can run git fsck to verify your entire repo.
Deduplication: Identical files share one blob, saving space across branches and history.
Immutability: Objects are never modified — new commits create new objects. Old ones stay forever (until garbage collected).
Debugging: When something goes wrong, git cat-file lets you inspect any object directly.

Code

bash

# Look at the latest commit object
git cat-file -p HEAD
# Output:
# tree f8d9e1a...
# parent e4f5g6h...
# author Alice <alice@example.com> 1706000000 +0000
# committer Alice <alice@example.com> 1706000000 +0000
#
# Add homepage

# Look at the tree (folder listing) that commit points to
git cat-file -p f8d9e1a
# Output:
# 100644 blob 3c7a8b2...    index.html
# 100644 blob 9d4e5f6...    style.css
# 040000 tree b2c3d4e...    images

# Look at a blob (raw file content)
git cat-file -p 3c7a8b2
# Output: <raw contents of index.html>

# Check the type of any object
git cat-file -t a1b2c3d
# Output: commit

# Verify repository integrity
git fsck
# Output: Checking object directories: done.

The .git Directory

When you run git init, Git creates a hidden .git/ folder. Here's what's inside:

Path	Purpose
`.git/objects/`	All blobs, trees, commits, and tags
`.git/refs/`	Branch and tag pointers
`.git/HEAD`	Points to the current branch
`.git/config`	Repository-level settings
`.git/hooks/`	Automation scripts

bash

# Peek inside the objects directory
ls .git/objects/
# Output: folders named by first 2 chars of SHA-1 hashes
# e.g., 3c/  9d/  a1/  b2/  f8/  ...

Key Takeaways

Git stores data as 4 types of objects: blobs (file content), trees (directories), commits (snapshots), and tags (labels).
Every object is identified by a SHA-1 hash — a fingerprint of its contents.
Identical content = identical hash = Git stores it only once (deduplication).
The .git/ directory is the brain of your repository — objects, refs, HEAD, and config all live there.

Interview Prep

Q: What are the four types of objects in Git? A: Blob (file content), Tree (directory structure), Commit (snapshot pointing to a tree with metadata), and Tag (annotated label for a commit). Each is identified by a SHA-1 hash.
Q: How does Git ensure data integrity? A: Every object is identified by a SHA-1 hash generated from its content. If even one byte changes, the hash changes, making corruption immediately detectable. You can verify integrity with git fsck.
Q: If you commit the same file content in two different branches, does Git store it twice? A: No. Git is content-addressable — identical content produces the same SHA-1 hash, so Git stores only one blob object. Both branches reference the same object, saving space.