Sunday, June 29, 2025

How to Remove Large Files from Git History: A Complete Guide


Have you ever encountered the dreaded "file exceeds size limit" error when trying to push to GitLab, GitHub, or other Git hosting services? You're not alone! This is a common issue that developers face, especially when accidentally committing large build artifacts, dependencies, or media files to their repositories.

The Problem: Git Push Rejected Due to Large Files

Recently, while working on a Java project, I encountered this exact error:

remote: GitLab: You are attempting to check in one or more blobs which exceed the 100.0MiB limit:
remote:
remote: - 7fd3bc8c77bf9608054e674f2e69a02a7d73191c (106 MiB)
remote:
remote: To resolve this error, you must either reduce the size of the above blobs, or utilize LFS.

The issue was a GWT plugin ZIP file (GWT plugins/gwt-2.5.1.zip) that was over 100MB - something that should never have been committed to the repository in the first place.

Step 1: Identify the Problematic File

When Git gives you a blob ID, you can find the exact file using:

git ls-tree -r HEAD | grep 7fd3bc8c77bf9608054e674f2e69a02a7d73191c

Alternative: Find All Large Files

If you want to audit your entire repository for large files:

# Find all files larger than 50MB
git rev-list --objects --all | git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' | awk '/^blob/ {if($3 > 52428800) print $3/1024/1024 " MB " $4}' | sort -n

# Or check current directory
find . -type f -size +50M -exec ls -lh {} \;

Step 2: Remove the File from Git History

Once you've identified the large file, you have several options to remove it completely from your Git history.

Method 1: git filter-repo (Modern Approach)

First, install git-filter-repo:

pip install git-filter-repo

Then remove the file:

git filter-repo --path "GWT plugins/gwt-2.5.1.zip" --invert-paths

Method 2: git filter-branch (Legacy but Reliable)

This is the method that worked perfectly in our case:

git filter-branch --force --index-filter \
'git rm --cached --ignore-unmatch "GWT plugins/gwt-2.5.1.zip"' \
--prune-empty --tag-name-filter cat -- --all

Why this works:

  • --force: Overwrites existing filter-branch results
  • --index-filter: Runs the command on the index (staging area)
  • git rm --cached --ignore-unmatch: Removes the file from index, ignoring if it doesn't exist in some commits
  • --prune-empty: Removes commits that become empty after filtering
  • --tag-name-filter cat: Preserves tag names
  • -- --all: Applies to all branches and tags

Method 3: BFG Repo-Cleaner (Alternative)

Download BFG and run:

java -jar bfg.jar --delete-files "gwt-2.5.1.zip" your-repo.git

Step 3: Clean Up and Push

After removing the file from history:

  1. Clean up Git references:

    git for-each-ref --format="delete %(refname)" refs/original | git update-ref --stdin
    git reflog expire --expire=now --all
    git gc --prune=now --aggressive
    
  2. Force push to update remote:

    git push origin --force --all
    git push origin --force --tags
    

Step 4: Prevent Future Issues

Add problematic file types to .gitignore:

echo "GWT plugins/" >> .gitignore
echo "*.zip" >> .gitignore
echo "*.war" >> .gitignore
echo "*.jar" >> .gitignore
echo "build/" >> .gitignore
echo "target/" >> .gitignore
git add .gitignore
git commit -m "Add gitignore for build artifacts and large files"

Common Large Files to Avoid in Git

  • Build artifacts: .war, .jar, .ear files
  • Dependencies: Node modules, Maven dependencies, Python packages
  • Media files: Large images, videos, audio files
  • Database dumps: SQL files, database backups
  • IDE files: Large project files, caches
  • Compressed archives: .zip, .tar.gz, .rar files

Alternative Solutions

Git LFS (Large File Storage)

If you need to track large files:

git lfs install
git lfs track "*.zip"
git add .gitattributes
git add your-large-file.zip
git commit -m "Add large file with LFS"

External Storage

Consider storing large files in:

  • Cloud storage (AWS S3, Google Cloud Storage)
  • Artifact repositories (Nexus, Artifactory)
  • CDNs for media files

Important Warnings

⚠️ Before rewriting Git history:

  • Create a backup of your repository
  • Coordinate with your team - they'll need to re-clone after force pushing
  • Understand the impact - this changes commit hashes and can break existing pull requests

⚠️ Force pushing considerations:

  • Only force push to branches you own
  • Never force push to main/master without team agreement
  • Consider using --force-with-lease for safer force pushing

Conclusion

Large files in Git repositories are a common problem, but they're easily solvable with the right tools. The git filter-branch command proved to be the most reliable solution for completely removing the problematic GWT plugin file from the repository history.

Key takeaways:

  1. Always use .gitignore to prevent committing large files
  2. Regular repository audits can catch issues early
  3. git filter-branch is a powerful tool for cleaning Git history
  4. Consider Git LFS for legitimate large file needs
  5. Always backup before rewriting history

Remember, the best approach is prevention - set up proper .gitignore files from the start and educate your team about what should and shouldn't be committed to version control.

Have you encountered similar issues with large files in Git? What solutions worked best for your team? Share your experiences in the comments below!


This guide was based on a real-world scenario where a 106MB GWT plugin ZIP file was accidentally committed to a Java project repository. The git filter-branch solution successfully resolved the issue and allowed the code to be pushed to GitLab.

No comments:

Post a Comment