Large Repo Hygiene

The Hook (The "Byte-Sized" Intro)

A repo doesn't get big overnight. It happens one 50MB video, one committed node_modules, one accidental database dump at a time. And here's the catch: Git never forgets. Even if you delete the file in the next commit, the blob stays in history forever. Prevention is 100x easier than cleanup. These hygiene habits keep your repo lean from day one.

📖 What is Large Repo Hygiene?

Habits and practices that prevent repositories from growing unnecessarily large, keeping Git operations fast and clone times reasonable.

Conceptual Clarity

The hygiene checklist:

#	Practice	Why
1	Comprehensive `.gitignore`	Keep build artifacts, deps, and OS files out
2	Use Git LFS for binaries	Large files stored efficiently outside Git
3	No committed dependencies	`node_modules`, `.venv` belong in `.gitignore`
4	No secrets in history	Once committed, secrets persist in history
5	Review large file additions	PR checks can catch oversized files
6	Prune stale branches	`fetch.prune true` removes dead remote branches
7	Regular `git gc`	Compresses objects and removes unreachable data
8	Monitor repo size	Track growth before it becomes a problem

Common repo bloat sources:

Source	Size Impact	Prevention
`node_modules/`	500MB+	`.gitignore`
Video/image assets	100MB+ per file	Git LFS
Database dumps	50MB+	`.gitignore`
Build artifacts	50-500MB	`.gitignore`
IDE files	<5MB but noisy diffs	`.gitignore`
Accidental binaries	Varies	Pre-commit hook

Real-Life Analogy

Repo hygiene is like kitchen hygiene. Clean as you cook (ignore files, use LFS) and the kitchen stays usable. Let dishes pile up (commit binaries, skip .gitignore) and eventually nobody can work in there.

Visual Architecture

flowchart TD COMMIT["Every commit"] --> CHECK{"Large file?"} CHECK -->|"Binary > 1MB"| LFS["📎 Git LFS"] CHECK -->|"Build artifact"| IGNORE["🚫 .gitignore"] CHECK -->|"Source code"| GIT["📦 Git"] style LFS fill:#1a1a2e,stroke:#ffd700,color:#ffd700 style IGNORE fill:#2d1b1b,stroke:#e94560,color:#e94560 style GIT fill:#1b2d1b,stroke:#53d8fb,color:#53d8fb

Why It Matters

Clone speed: A 5GB repo takes minutes to clone; a 50MB repo takes seconds.
CI cost: Large repos increase build times and storage costs.
Permanent: Git never truly deletes data from history — prevention is key.
Team friction: Nobody wants to wait 10 minutes to clone a repo.

Code

bash

# ─── Monitor repo size ───
git count-objects -vH
# count: 0
# size: 0 bytes
# size-pack: 45.2 MiB  ← Total compressed size

# ─── Find large files in history ───
git rev-list --objects --all \
  | git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' \
  | sed -n 's/^blob //p' \
  | sort -rnk2 \
  | head -10
# Shows the 10 largest blobs ever committed

# ─── Pre-commit hook to block large files ───
cat > .git/hooks/pre-commit << 'EOF'
#!/bin/sh
MAX_SIZE=5242880  # 5MB in bytes
for file in $(git diff --cached --name-only); do
  size=$(wc -c < "$file" 2>/dev/null || echo 0)
  if [ "$size" -gt "$MAX_SIZE" ]; then
    echo "❌ $file is $(($size / 1048576))MB. Use Git LFS for files > 5MB."
    exit 1
  fi
done
EOF
chmod +x .git/hooks/pre-commit

# ─── Clean up (if damage is already done) ───
# Use BFG Repo Cleaner to remove large files from history:
# java -jar bfg.jar --strip-blobs-bigger-than 10M repo.git
# git reflog expire --expire=now --all && git gc --prune=now --aggressive

Key Takeaways

Prevention > cleanup. Once a large file is in history, removing it is complex.
Use .gitignore for build artifacts and dependencies; Git LFS for binaries.
A pre-commit hook can block files above a size threshold.
git count-objects -vH monitors repo size; review regularly.

Interview Prep

Q: Why is repo size management important? A: Large repos slow down cloning, CI/CD pipelines, and everyday Git operations. Since Git stores full history, large files committed once remain in the repo forever unless explicitly scrubbed from history.
Q: How do you remove a large file that was accidentally committed? A: Use BFG Repo-Cleaner or git filter-repo to rewrite history and remove the file from all commits. Then force-push and have all team members re-clone. This is why prevention (.gitignore, LFS, pre-commit hooks) is much better.
Q: What is Git LFS and when should you use it? A: Git Large File Storage replaces large files in the repo with small pointer files. The actual file content is stored on a separate LFS server. Use it for binaries, videos, images, and any files > 1-5MB that need to be versioned.