SHA Hashing

The Hook (The "Byte-Sized" Intro)

Every commit SHA you see — a1b2c3d — isn't random. It's a fingerprint computed from the object's content using SHA-1. Change one character in a file and the hash changes completely. Same content always produces the same hash. This is how Git guarantees data integrity — if any bit is corrupted, the hash won't match and Git knows something is wrong.

📖 What is SHA Hashing?

SHA-1 is a cryptographic hash function that takes any input and produces a 40-character hexadecimal string. Git uses it to create unique identifiers for every object (blob, tree, commit, tag).

Conceptual Clarity

SHA properties in Git:

Property	What It Means
Deterministic	Same input → always the same hash
Unique	Different content → different hash (virtually guaranteed)
Fixed length	Always 40 hex chars (160 bits)
One-way	Can't reverse a hash to get the content
Avalanche	Change 1 byte → completely different hash

What gets hashed for each object type:

Object	Input to SHA-1
Blob	`blob <size>\0<content>`
Tree	`tree <size>\0<entries>`
Commit	`commit <size>\0<tree+parents+author+message>`
Tag	`tag <size>\0<object+type+tagger+message>`

Real-Life Analogy

SHA hashing is like a fingerprint scanner. Every person has a unique fingerprint. If you scan the same finger twice, you get the same result. If someone swaps in a different finger, the scan won't match. Git "fingerprints" every object to ensure nothing has been tampered with.

Visual Architecture

flowchart LR CONTENT["📄 File Content"] --> HASH["🔐 SHA-1"] HASH --> SHA["a1b2c3d4e5f6..."] CONTENT2["📄 Same Content"] --> HASH2["🔐 SHA-1"] HASH2 --> SHA2["a1b2c3d4e5f6..."] style SHA fill:#0f3460,stroke:#53d8fb,color:#53d8fb style SHA2 fill:#0f3460,stroke:#53d8fb,color:#53d8fb

Why It Matters

Data integrity: Corruption is detected instantly — hash won't match.
Deduplication: Identical files share the same hash and same storage.
Unique addressing: Every object is globally unique across all repos.
Distributed: No central authority needed to assign IDs.

Code

bash

# ─── Compute a blob hash manually ───
echo -n "Hello World" | git hash-object --stdin
# 5e1c309dae7f45e0f39b1bf3ac3cd9db12e7d689

# ─── Same content always = same hash ───
echo -n "Hello World" | git hash-object --stdin
# 5e1c309dae7f45e0f39b1bf3ac3cd9db12e7d689  (identical!)

# ─── Different content = different hash ───
echo -n "Hello World!" | git hash-object --stdin
# c57eff55ebc0c54973903af5f72bac72762cf4f4  (completely different!)

# ─── View a commit's full SHA ───
git rev-parse HEAD
# a1b2c3d4e5f6789...  (40 characters)

# ─── Short SHA (usually 7 chars, enough to be unique) ───
git rev-parse --short HEAD
# a1b2c3d

Key Takeaways

Every Git object gets a SHA-1 hash computed from its content.
Same content always produces the same hash — enabling deduplication.
Any change to content produces a completely different hash — enabling integrity checks.
Short SHAs (7+ chars) are used for display; full SHAs (40 chars) are stored internally.

Interview Prep

Q: Why does Git use SHA-1 hashing? A: SHA-1 provides content-addressable storage (same content = same address), data integrity verification (corruption changes the hash), and globally unique identifiers without a central authority — all essential for a distributed version control system.
Q: What makes SHA hashes "content-addressable"? A: The hash IS the address. You don't assign IDs — the content determines its own ID. This means the same file in different repositories, branches, or commits always has the same hash, enabling efficient deduplication and comparison.
Q: Is SHA-1 still secure for Git? A: While SHA-1 has known cryptographic weaknesses (collision attacks), Git mitigates this with additional checks and is transitioning to SHA-256. For Git's use case (content addressing, not security), the risk is extremely low.