The Hook (The "Byte-Sized" Intro)
Every commit SHA you see — a1b2c3d — isn't random. It's a fingerprint computed from the object's content using SHA-1. Change one character in a file and the hash changes completely. Same content always produces the same hash. This is how Git guarantees data integrity — if any bit is corrupted, the hash won't match and Git knows something is wrong.
📖 What is SHA Hashing?
SHA-1 is a cryptographic hash function that takes any input and produces a 40-character hexadecimal string. Git uses it to create unique identifiers for every object (blob, tree, commit, tag).
Conceptual Clarity
SHA properties in Git:
| Property | What It Means |
|---|---|
| Deterministic | Same input → always the same hash |
| Unique | Different content → different hash (virtually guaranteed) |
| Fixed length | Always 40 hex chars (160 bits) |
| One-way | Can't reverse a hash to get the content |
| Avalanche | Change 1 byte → completely different hash |
What gets hashed for each object type:
| Object | Input to SHA-1 |
|---|---|
| Blob | blob <size>\0<content> |
| Tree | tree <size>\0<entries> |
| Commit | commit <size>\0<tree+parents+author+message> |
| Tag | tag <size>\0<object+type+tagger+message> |
Real-Life Analogy
SHA hashing is like a fingerprint scanner. Every person has a unique fingerprint. If you scan the same finger twice, you get the same result. If someone swaps in a different finger, the scan won't match. Git "fingerprints" every object to ensure nothing has been tampered with.
Visual Architecture
Why It Matters
- Data integrity: Corruption is detected instantly — hash won't match.
- Deduplication: Identical files share the same hash and same storage.
- Unique addressing: Every object is globally unique across all repos.
- Distributed: No central authority needed to assign IDs.
Code
# ─── Compute a blob hash manually ───
echo -n "Hello World" | git hash-object --stdin
# 5e1c309dae7f45e0f39b1bf3ac3cd9db12e7d689
# ─── Same content always = same hash ───
echo -n "Hello World" | git hash-object --stdin
# 5e1c309dae7f45e0f39b1bf3ac3cd9db12e7d689 (identical!)
# ─── Different content = different hash ───
echo -n "Hello World!" | git hash-object --stdin
# c57eff55ebc0c54973903af5f72bac72762cf4f4 (completely different!)
# ─── View a commit's full SHA ───
git rev-parse HEAD
# a1b2c3d4e5f6789... (40 characters)
# ─── Short SHA (usually 7 chars, enough to be unique) ───
git rev-parse --short HEAD
# a1b2c3dKey Takeaways
- Every Git object gets a SHA-1 hash computed from its content.
- Same content always produces the same hash — enabling deduplication.
- Any change to content produces a completely different hash — enabling integrity checks.
- Short SHAs (7+ chars) are used for display; full SHAs (40 chars) are stored internally.
Interview Prep
-
Q: Why does Git use SHA-1 hashing? A: SHA-1 provides content-addressable storage (same content = same address), data integrity verification (corruption changes the hash), and globally unique identifiers without a central authority — all essential for a distributed version control system.
-
Q: What makes SHA hashes "content-addressable"? A: The hash IS the address. You don't assign IDs — the content determines its own ID. This means the same file in different repositories, branches, or commits always has the same hash, enabling efficient deduplication and comparison.
-
Q: Is SHA-1 still secure for Git? A: While SHA-1 has known cryptographic weaknesses (collision attacks), Git mitigates this with additional checks and is transitioning to SHA-256. For Git's use case (content addressing, not security), the risk is extremely low.