Using Git subtrees for repository separation

Yesterday I needed to break out some shared modules into their own repositories to avoid the monolithic repository anti-pattern I’ve been so fond of in the past.

I faced a few options:

  • Git submodules: Yuck! I’ve used them before and we didn’t get on well. They’re too intrusive, requiring submodules to be initialised and updated. And switching between branches, which is where Git really shines, suddenly becomes painful because submodules don’t do what you expect. Not convinced? Read more about the issues with submodules.
  • Gitslave and Repo look interesting but I was easily dissuaded after a brief read here. They don’t seem to fit my requirements.
  • Git subtree: This promised to give me what I wanted from git submodules without the administrative downsides. The supposed disadvantage is that with subtrees you end up with a copy of the upstream module’s source code in your downstream repository. However, to me that’s a small price to pay and it may even have it’s advantages in that you have more control, e.g. you could make temporary project-specific tweaks to the code.

So I decided to try out the git subtree approach. I’ve only used it for a day but it seems to be working nicely.

Splitting code into its own repository

The code I wanted to have as a shared, upstream repository was already living in my downstream repository so I needed to first split it out. If you are starting a new project you don’t need to this and can skip to the next part.

Let’s assume that in your main repository you have a directory at /path/to/code that you want to split off into it’s own repository called “shared”.

  • Create a new BARE local repository, e.g. ~/shared/
      mkdir shared
      cd shared
      git init --bare
  • Create a new remote repository for your shared code, e.g via the GitHub or Bitbucket web interface.
  • Back in the main repository you are splitting from, split the shared code into a branch called “split”
      git subtree split --prefix=path/to/code -b split

The new branch “split” will only contain the code from that path. Note: It will have all the commit history, which, unless you’ve been fastidious with your commits, will probably contain messages that pertain to your main repository. If you prefer you can squash all the commits into a single one by using the –squash switch when issuing the split command above.

  • Now push the new branch to your new local shared repository
      git push ~/shared/ split:master
  • From the new local shared repository, push the commit to the new remote shared repository
      git remote add origin ssh://git@bitbucket.org/xyz/shared.git
      git push origin master

Done! You now have a shiny new repository containing your shared code, and you’re ready to share it with the world.

The next step is to make the shared repository a subtree of your main repository.

Adding the repository as a subtree of your main repository

In your main repository, you need to get rid of the original files that you split, and then add the remote repository as a subtree instead.

  • Delete the entire directory you split from, and then commit.
      git rm -r path/to/code
      git commit -am "Remove split code."
  • Add the new shared repository as a remote
      git remote add shared ssh://git@bitbucket.org/xyz/shared.git
  • Now add the remote repository as a subtree
      git subtree add --prefix=path/to/code --squash shared master

Note: we use the –squash switch because we probably just want a single snapshot commit representing version X of the shared module, rather than complicating our own commit history with spurious upstream bugfix commits. Of course if you want the entire history then feel free to leave off that switch.

You now have subtree based on an upstream repository. Nice.

git-subtree-add

In the image you can see the bottom commit is the squashed commit containing all the upstream code and this is merged with your code.

Important note: Do not be tempted to rebase this. Push it as is. If you rebase, git subtree won’t be able to reconcile the commits when you do your next subtree pull.

So far so good. But this isn’t much use if you can’t receive changes from the upstream repository. Luckily that’s easy.

Pulling upstream changes

To pull changes from the upstream repository, just use the following command:

      git subtree pull --prefix=path/to/code --squash shared master

(You are squashing all newer upstream commits into a single one that will then be merged into your repository). Important: as mentioned above, do not rebase these commits.

Pushing changes to the upstream repository

Contributing changes to the upstream repository is as simple as:

      git subtree push --prefix=path/to/code --squash shared master

That’s really the extend of my knowledge at this point. I’ve used this approach to reorganise module sharing between several applications and so far so good. But it’s early days still; if I come across any issues I’ll blog them.

tl;dr

Git subtree seems to be a neat way of sharing modules between Git repositories. Watch this space for gotchas.

Advertisement

28 thoughts on “Using Git subtrees for repository separation

  1. this was soooo useful. I had a project with many shared modules that weren’t always used in every solution. By allow this to be our new pattern we could keep the history of our changes to each module and still pull and push to each module as necessary

  2. Hi there,

    I’ve been working on splitting up a large repository but I’ve come across a problem using the git subtree split.

    If the folder name has dots in it then it looks like it is working but there is no branch at the end…

  3. “Note: It will have all the commit history, which, unless you’ve been fastidious with your commits, will probably contain messages that pertain to your main repository.”

    Will the new shared repository also contain the data corresponding to the old contents of files of the old main repo? In other words, can someone checkout (from the history info) old versions of the old files from the main repo that are not included in the shared repository?

    • No, I should have worded that better. The shared repo will contain the commit message history for any commit that changed a shared file. So a `git log` on your shared repo might contain messages pertaining to the commits of your old main repo.

      • I believe it matters whether you do

        git remote add shared ssh://git@bitbucket.org/xyz/shared.git
        git subtree add –prefix=path/to/code –squash shared master

        as you described vs. doing this single command with the remote repo expressed explicitly and directly.

        git subtree add –prefix=path/to/code –squash ssh://git@bitbucket.org/xyz/shared.git master

        If you look carefully at the output from the git fetch that is done as part of the git subtree add, the former method adds a “[new branch]” that is not added by doing the latter method. (Try it each way in two test repos and use git branch -a to also see the difference.)

        This “[new branch]” is a remote-tracking branch and it holds the history of the remote that one can see mixed in, for example, when you are looking graphically at all branches of your repo.

        If you use the latter method and add the subtree without referencing a local remote name, I believe you should find that the –squash prevents the history of the subtree from cluttering up your local repository history and log.

        (My next question is what others would recommend for the proper way to clean out the remote’s added stuff when a repo has already mixed remote-tracking branch history into a repo. Some steps I thought should work do not always work completely.)

      • Thanks, that’s interesting Eric. In my case I’m not concerned about the history from the shared repo being mixed in with local – after all I’m using the shared code so the commit history is relevant. But I can imagine scenarios where your approach would be useful.

      • BTW, for those who use –squash with subtree add, pull, etc. (because they want to avoid mixing subtree repo history and parent project history), if they want to use a defined remote for convenience, they should also use “git remote add –no-tags …”. The –no-tags will exclude bringing over any tags into the remote tracking branch. That is what causes trouble for bringing in unwanted subtree history into the parent repo.

  4. Thanks for your explanation. It seems very useful to me.
    One question: are the changes pushed to local repositories of other users or do they have to add the remote repositories by themselves?
    They will pull the deletion but will they geht the subtree information?

    • If you make changes to a “shared file” in your main repo and push it, other user of your main repo will obviously get those changes like any other change.
      If other users of your main repo want to be able to push changes to the shared repo they will need to do a `git remote add …` and a `git subtree add …` before they can `git subtree push …`.
      If users of another repo (say a different project) want to pull your changes from the shared repo then they will also need to `git remote add …` and a `git subtree add …` before they can `git subtree pull …`.

  5. Hi again.
    Now I did some tests and I´m always ending up in a merge conflict if I try to pull the changes from a “subtree”-repository into my main repository.
    I cannot explain where the conflict comes from, there are no changes on the file in the main repository.
    Do you have any ideas where the conflicts come from and how to prevent them?

    • @Jochen, multiple posters on StackOverflow have had this problem as well. Have you been using –squash for all your subtree pushing and pulling? What version of Git are you using and on what platform?

      • p.s. To be more clear, I have not encountered the problem as yet, but I do use –squash for all subtree pushing and pulling.

    • Hi Jochen, have you been able to solve the issue? I’m running into the exact same problem: Pull > merge conflict, Push > rejected. I’m stuck.
      Thanks, Igor

  6. Merging subtree split to another branch: Is it safe?

  7. assertion failed errors when trying to git subtree split

  8. Each time I clone the super-repo, or checkout a new branch, it seems to me that it is necessary to perform the entire sequence again

    – remove sub-directory of the sub-repo; commit
    – add/fetch remote branch of the sub-repo
    – configure subtree linkage to local remote branch

    What we have are both super-repo and sub-repos having branches targeting a deployment environment (e.g. Test, Staging). So say the Test branch of the super-repo should be referencing the Test branch of the sub-repos. Correspondingly, the Staging branch uses sub-repos’ Staging branch code.

    Would have been nice if these configurations somehow are preserved in the super-repo’s git config. Then anybody cloning the repo wouldn’t have to go through these hassles again.

  9. Amazing stuff. I would love to be able to sort of cherry pick only the relevant files that I need instead of pulling in the whole repo, but still this seems to be a much better approach than submodules, at least in my case. Also, the ability to push changes to the upstream repo is great.

    Thanks for the post, I think this feature should be more documented as I can’t find that much of information around about it.

  10. I tried git subtree to copy only a folder from a repository . Unfortunately I didn’t find a way to to copy subdirectory from another repo, that support later PUSH the changes back to original repo. The answer
    http://stackoverflow.com/questions/23937436/add-subdirectory-of-remote-repo-with-git-subtree has a few options, but all of them seems only discussed one-way (pull) syncronization. Can you suggest, how to copy to subfolder of my repository the folder from library repository, that will allow later two-way (pull/push) syncronization?

    • I think you would need to make that subdirectory into its own –bare repository. Then make its original repository (and any other repository that needs it) use git subtree to pull in (or push out to) that separated content.

      In other words, let go of the idea of making the original repository the “home” of that subdirectory. Give it its own –bare repository home and make the original repository another one of the clients of that access that content using git subtree.

      Git supports extracting that subdirectory content to its own –bare repository while retaining its past history. Search for other posts about exactly how to do that.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s