Learnings from using GitHub template repositories at DevRev

Shivansh Rai
DevRev
Published in
8 min readMar 28, 2022

--

By Shivansh Rai and Prabath Siriwardena

At DevRev, we use a GitHub template repository as the basis for almost all the projects within the organization. It contains deployment configurations and boilerplate code for building gRPC services. This template repository generates new repositories with the same directory structure, branches, and files. However, GitHub doesn’t preserve commit history in new repositories and squashes all commits from the template into one. This proves problematic when propagating changes from the template to its generated repository as git pull¹ no longer works.

We occasionally upgrade the template repository’s dependency versions to include new features and bug fixes. Unfortunately, these version changes had to be propagated manually to downstream repositories, which was tedious and error-prone. As a result, our engineers tended to ignore these chores leading to a lag between dependency versions of the template and its generated repositories.

Image by Roman Synkevych on Unsplash

Why git doesn’t work

git allows comparing files in different repositories and evaluating their diff. Let’s consider the task of synchronizing go.mod files between a template and its generated repository. Go’s dependency management system uses go.mod² files for managing dependency version information.

For simplicity, this blog post will use the following terminologies -

  • upstream as the template repository
  • downstream as the repository generated using the template repository
  • diff as the output of git diff
  • cloning as the process of generating a repository out of a template repository

The files upstream/go.mod and downstream/go.mod belong to the upstream and downstream, respectively. They contain the dependency version information of the respective version-controlled Go projects upstream and downstream.

$ cat upstream/go.mod
module upstream
go 1.17require go.uber.org/zap v1.21.0 // upgraded after cloning$ cat downstream/go.mod
module downstream
go 1.17require (
github.com/google/go-cmp v0.5.7 // added after cloning
go.uber.org/zap v1.20.0
)

The minor version³ of the dependency go.uber.org/zap was upgraded from v1.20.0 to v1.21.0 in upstream after cloning downstream, causing it to be missing from downstream/go.mod. Also, we added the dependency github.com/google/go-cmp to downstream after cloning it.

Since these go.mod files belong to different repositories, we will use git diff --no-index⁴ to evaluate their diff -

$ git diff --no-index downstream/go.mod upstream/go.mod | tee diff.patchdiff --git a/downstream/go.mod b/upstream/go.mod
index 6b7d801..ca017e7 100644
--- a/downstream/go.mod
+++ b/upstream/go.mod
@@ -1,8 +1,5 @@
-module downstream
+module upstream
go 1.17-require (
- github.com/google/go-cmp v0.5.7 // added after cloning
- go.uber.org/zap v1.20.0
-)
+require go.uber.org/zap v1.21.0 // upgraded after cloning

We switch the working directory to downstream before applying⁵ the patch so that we’re in the working tree controlled by git -

$ mv diff.patch downstream
$ pushd downstream
$ git apply -p2 diff.patch
$ popd
$ cat downstream/go.mod
module upstream
go 1.17require go.uber.org/zap v1.21.0 // upgraded after cloning

The final state of downstream/go.mod is the result of the following modifications -

  1. The module name is modified to upstream.
  2. The version of the dependency go.uber.org/zap is upgraded to v1.21.0.
  3. The dependency github.com/google/go-cmp, which was introduced downstream after cloning, is lost.

Modifications 1 and 3 are invalid.

The above strategy would’ve been ideal if modifications were never introduced downstream after cloning. However, practically this is not the case, and applying the diff will overwrite all such changes.

In this example, we intentionally did not use go get -u to manipulate dependency versions of downstream as we wanted to propagate these version changes from upstream.

The shortcomings of this approach give an idea of the challenges to be dealt with, described below -

  1. New rules introduced downstream after cloning should be preserved.
  2. We cannot consider all upstream changes valid and should analyze them before applying.
  3. We need to preserve modifications introduced downstream after cloning, but not always. For example, if a user manually upgraded the downstream dependency version to v0.0.2, but the same dependency in upstream points at v0.0.3 later, we should upgrade the downstream version regardless the user modified it after cloning.
  4. The overall diff size should be minimal to allow easy code reviews.

AST to the rescue

Let’s consider Bazel⁶, a common build tool used extensively at DevRev. Bazel uses WORKSPACE⁷ files for managing external dependency versions. We can translate the problem at hand to these files.

The buildtools⁸ project by Bazel provides a parser for converting a Bazel file into an Abstract syntax tree⁹ (AST), a sequence of nodes representing Bazel expressions and statements. The Wagner-Fischer algorithm¹⁰ allows us to compute the edit distance between upstream and downstream ASTs. So, instead of dealing with a string of characters, as is usually the case, we deal with an array of AST nodes. However, the AST nodes need to be comparable for the algorithm to work. As we’ll see later, this happens to be true.

After calculating the edit distance, we can utilize the 2-D matrix used for the same to evaluate the shortest transformation sequence. The (hoverable) examples in the Wikipedia page demonstrate this process.

A single AST node will carry the following contextual information about the Bazel rule, which will turn out helpful -

  • name of the rule
  • fields of the rule
  • the span of the rule (starting and ending line numbers)

The shortest transformation sequence for converting downstream AST to upstream AST will be of the following form -
d₁, -d₂, d₃, +u₁, d₄, +u₂, d₅, -d₆

where dᵢ and uⱼ are downstream and upstream AST nodes, respectively. As noted previously, this transformation sequence needs to be processed further.

Assuming … represents a sequence of unchanged downstream nodes, we can generalize the transformation sequence under the following cases -

  1. -dᵢ, …
    implies that dᵢ is dropped from downstream
  2. +uᵢ, …
    implies that uᵢ from upstream is added to downstream
  3. -dᵢ, …, +uⱼ
    implies that uⱼ from upstream replaced dᵢ from downstream. If dᵢ and uⱼ correspond to the same Bazel rule, an equivalent rearranged transformation sequence which reduces diff size is - …, (-dᵢ, +uⱼ), …
  4. +uᵢ, …, -dⱼ
    implies that uᵢ from upstream replaced dⱼ from downstream. If uᵢ and dⱼ correspond to the same Bazel rule, an equivalent rearranged transformation sequence which reduces diff size is - …, (+uᵢ, -dⱼ), …

Cases 3 and 4 allow rearranging the transformation sequence to reduce the diff size. So instead of getting two separate entries in the diff (one for deletion and addition each), we get a single entry for modification. We will only rearrange if d₂ and u₁ (similarly u₂ and d₆) correspond to the same Bazel rule. The diff size will reduce because the same constituent fields among the rules (like name) will not contribute to the diff, allowing easier code reviews.

We can retrieve the Bazel rule name from the AST node and use it to ignore modifications to rules that aren’t supposed to be modified, such as workspace(name = "foo").

We need to preserve the user-introduced downstream nodes. The AST node contains the line numbers it spans, which we can use with git blame to derive whether the node was introduced downstream after cloning.

Comparing AST nodes

We can compare AST nodes based on whether they belong to different Bazel rules and use the name field to compare the nodes belonging to the same rule. For example, the expressions below belong to the same rule http_archive but contain different name fields.

http_archive(
name = "com_google_protobuf",
sha256 = "1c11b325e9fbb655895e8fe9843479337d50dd0be56a41737cbb9aede5e9ffa0",
strip_prefix = "protobuf-3.15.3",
urls = ["https://github.com/protocolbuffers/protobuf/archive/v3.15.3.zip"],
)
http_archive(
name = "golink",
sha256 = "ea728cfc9cb6e2ae024e1d5fbff185224592bbd4dad6516f3cc96d5155b69f0d",
strip_prefix = "golink-1.0.0",
urls = ["https://github.com/nikunjy/golink/archive/v1.0.0.tar.gz"],
)

Diff validation

While synchronizing, old dependency versions from upstream should not overwrite new versions from downstream. When we encounter a node modification in the transformation sequence, we extract the semver from the urls field and use it for comparison. The upstream change is accepted if it promotes the dependency version.

A few things to be kept in mind while using this validation strategy -

  • Not all dependencies follow semantic versioning in their URLs
  • A particular rule can have multiple entries in the urls field

Architecture

We can break the process of evaluating diffs and publishing them into the following three phases -

  • Retrieving data from upstream and downstream (crawling)
  • Processing the data (parsing)
  • Propagating the changes to downstream (publishing)

The components below describe these phases and are designed to be extensible and pluggable. So, for example, we can use turbolift¹¹ for the publishing phase.

Crawler

The tool accepts a YAML¹² configuration file, described below -

author_name: devrevbot
author_email: bot@devrev.ai
commit_message: "chore: sync with upstream template repository"
org: devrev
# Personal access token with scopes: ['repo:all'], ['read:org']
token:
# Entries under allowlist are taken into consideration and PRs
# sent for those. PRs are not sent for entries under blocklist.
# allowlist and blocklist are considered iff onlylist is empty.
# Otherwise, PRs are created only for entries under onlylist.
templaterepos:
- name: upstream-template
allowlist:
blocklist:
- downstream-repo
onlylist:

The crawler reads the configuration file and uses GitHub API to find the repositories generated by the provided templates. Then, it clones the downstream repositories to allow running git blame. Finally, the crawler provides the downstream Bazel files and their upstream counterparts to the parser.

Parser

The parser implements the AST-based approach described above and evaluates the final state of the downstream file, which is provided to the publisher.

Publisher

The publisher prepares a patch from the file received by the parser and uses GitHub API to create a downstream pull request.

High-level architecture

Results

We ran the tool on our internal projects to synchronize WORKSPACE files across upstream and downstream. We noticed that the tool introduced a few incorrect modifications by rolling back the dependency versions of downstream, whose semver was not easy to extract from the urls field. We identified these errors easily during code reviews. However, the tool could pull in all the new modifications from upstream. These modifications would otherwise have been tedious and error-prone to propagate manually.

Conclusion

This blog post described an algorithm for synchronizing WORKSPACE Bazel files between a template and its generated repositories. The algorithm is not limited to Bazel and will work for any file for which ASTs can be generated. We reduced the scope for human-introduced errors while boiling down the entire process to a single command-line invocation.

--

--