The Hook (The "Byte-Sized" Intro)
A repo doesn't get big overnight. It happens one 50MB video, one committed node_modules, one accidental database dump at a time. And here's the catch: Git never forgets. Even if you delete the file in the next commit, the blob stays in history forever. Prevention is 100x easier than cleanup. These hygiene habits keep your repo lean from day one.
📖 What is Large Repo Hygiene?
Habits and practices that prevent repositories from growing unnecessarily large, keeping Git operations fast and clone times reasonable.
Conceptual Clarity
The hygiene checklist:
| # | Practice | Why |
|---|---|---|
| 1 | Comprehensive .gitignore | Keep build artifacts, deps, and OS files out |
| 2 | Use Git LFS for binaries | Large files stored efficiently outside Git |
| 3 | No committed dependencies | node_modules, .venv belong in .gitignore |
| 4 | No secrets in history | Once committed, secrets persist in history |
| 5 | Review large file additions | PR checks can catch oversized files |
| 6 | Prune stale branches | fetch.prune true removes dead remote branches |
| 7 | Regular git gc | Compresses objects and removes unreachable data |
| 8 | Monitor repo size | Track growth before it becomes a problem |
Common repo bloat sources:
| Source | Size Impact | Prevention |
|---|---|---|
node_modules/ | 500MB+ | .gitignore |
| Video/image assets | 100MB+ per file | Git LFS |
| Database dumps | 50MB+ | .gitignore |
| Build artifacts | 50-500MB | .gitignore |
| IDE files | <5MB but noisy diffs | .gitignore |
| Accidental binaries | Varies | Pre-commit hook |
Real-Life Analogy
Repo hygiene is like kitchen hygiene. Clean as you cook (ignore files, use LFS) and the kitchen stays usable. Let dishes pile up (commit binaries, skip .gitignore) and eventually nobody can work in there.
Visual Architecture
Why It Matters
- Clone speed: A 5GB repo takes minutes to clone; a 50MB repo takes seconds.
- CI cost: Large repos increase build times and storage costs.
- Permanent: Git never truly deletes data from history — prevention is key.
- Team friction: Nobody wants to wait 10 minutes to clone a repo.
Code
# ─── Monitor repo size ───
git count-objects -vH
# count: 0
# size: 0 bytes
# size-pack: 45.2 MiB ← Total compressed size
# ─── Find large files in history ───
git rev-list --objects --all \
| git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' \
| sed -n 's/^blob //p' \
| sort -rnk2 \
| head -10
# Shows the 10 largest blobs ever committed
# ─── Pre-commit hook to block large files ───
cat > .git/hooks/pre-commit << 'EOF'
#!/bin/sh
MAX_SIZE=5242880 # 5MB in bytes
for file in $(git diff --cached --name-only); do
size=$(wc -c < "$file" 2>/dev/null || echo 0)
if [ "$size" -gt "$MAX_SIZE" ]; then
echo "❌ $file is $(($size / 1048576))MB. Use Git LFS for files > 5MB."
exit 1
fi
done
EOF
chmod +x .git/hooks/pre-commit
# ─── Clean up (if damage is already done) ───
# Use BFG Repo Cleaner to remove large files from history:
# java -jar bfg.jar --strip-blobs-bigger-than 10M repo.git
# git reflog expire --expire=now --all && git gc --prune=now --aggressiveKey Takeaways
- Prevention > cleanup. Once a large file is in history, removing it is complex.
- Use
.gitignorefor build artifacts and dependencies; Git LFS for binaries. - A pre-commit hook can block files above a size threshold.
git count-objects -vHmonitors repo size; review regularly.
Interview Prep
-
Q: Why is repo size management important? A: Large repos slow down cloning, CI/CD pipelines, and everyday Git operations. Since Git stores full history, large files committed once remain in the repo forever unless explicitly scrubbed from history.
-
Q: How do you remove a large file that was accidentally committed? A: Use BFG Repo-Cleaner or
git filter-repoto rewrite history and remove the file from all commits. Then force-push and have all team members re-clone. This is why prevention (.gitignore, LFS, pre-commit hooks) is much better. -
Q: What is Git LFS and when should you use it? A: Git Large File Storage replaces large files in the repo with small pointer files. The actual file content is stored on a separate LFS server. Use it for binaries, videos, images, and any files > 1-5MB that need to be versioned.