A Serverless Solution to Keeping Git Repositories Synchronized

Synchronizing Git repositories is a fairly common requirement. This article describes how we built a solution to replicate an AWS CodeCommit repo to any other Git repo in realtime, using just a Lambda function.

Why Sync?

There are many scenarios wherein you might want to sync up 2 or more Git repositories, preferably in realtime.

Backup

Keeping backups of code is probably the most common reason for mirroring a Git repo elsewhere. This includes use cases like automatically replicating your CodeCommit repositories across AWS regions or copying code from CodeCommit to GitHub/GitLab.

Although the Lambda-based solution described in this article can be used for backups, it’s a bit overkill for this use case. That’s primarily due to 2 major reasons:

  • Most backups of Git repositories aren’t required to be other Git repositories. They’re usually just a ZIP or tarball of the repo uploaded to long-term storage like S3.
  • Backups also need not be “real-time.” Most people are happy with scheduled backup solutions that run, say, every 6 hours or so, depending on how heavily your source repo is used.

Integrating Disparate Systems

Often, your code resides in a system that won’t play well with other tools in your development workflow. For example, your CI/CD provider might not support integration with your Git provider. In that case, it makes sense to keep a supported Git provider in sync with your existing Git repo using an automated solution. With both code repositories in sync, you can achieve a fully-automated real-time CI/CD experience.

Our Solution

Our solution focuses specifically on taking code from a CodeCommit repo & mirroring it to any other Git repo, be it another CodeCommit repo in another AWS region or account, or outside of AWS to GitHub, GitLab, etc.

Although there already exist solutions to this challenge, like the one described in the AWS blog Replicating and Automating Sync-Ups for a Repository with AWS CodeCommit, we wanted a low-maintenance & FREE solution that could be reused/redeployed/replicated for every pair of repositories we needed to sync. Serverless & Lambda were the obvious answer!

SAM App

We started by creating an AWS SAM application, that would eventually grow to be the one-click solution we need. The end result of the deployed app is as shown below:

The heart of the solution is the Lambda function. Since using the Git CLI is the easiest way to clone & push repositories, we wanted our Lambda function to run a Shell script, instead of the conventional Python or Node.js code.

Running Shell Scripts in Lambda

Although running Bash scripts in a Lambda function is easily doable as described in Run Bash Scripts in AWS Lambda Functions, running Git is a whole new ball game! Git doesn’t come preinstalled in the base Lambda runtime & installing it is rather cumbersome. It’s easier to take charge of the Lambda container itself & install everything we need in it. That’s how we arrived at using an Amazon Linux 2 container for our Lambda function.

Dockerfile

Start by creating a Dockerfile to build the Lambda container:

FROM public.ecr.aws/lambda/provided
RUN yum update -y && yum install jq git -y && yum clean all
COPY bootstrap ${LAMBDA_RUNTIME_DIR}
COPY function.sh ${LAMBDA_TASK_ROOT}
CMD [ "function.handler" ]

The public.ecr.aws/lambda/provided base image is the Amazon Linux 2 runtime. The next line installs jq along with Git. The use of jq is described later in this article.

Lambda Bootstrap

bootstrap is an executable Bash script that will be invoked by the Lambda runtime interface client:

#!/bin/bash
set -euo pipefail

# Initialization - load function handler
source "$LAMBDA_TASK_ROOT"/"$(echo $_HANDLER | cut -d. -f1).sh"

# Processing
while true
do
  HEADERS="$(mktemp)"

  # Get an event. The HTTP request will block until one is received
  EVENT_DATA=$(curl -sS -LD "$HEADERS" -X GET "http://${AWS_LAMBDA_RUNTIME_API}/2018-06-01/runtime/invocation/next")

  # Extract request ID by scraping response headers received above
  REQUEST_ID=$(grep -Fi Lambda-Runtime-Aws-Request-Id "$HEADERS" | tr -d '[:space:]' | cut -d: -f2)

  # Run the handler function from the script
  RESPONSE=$($(echo "$_HANDLER" | cut -d. -f2) "$EVENT_DATA")

  # Send the response
  curl -X POST "http://${AWS_LAMBDA_RUNTIME_API}/2018-06-01/runtime/invocation/$REQUEST_ID/response"  -d "$RESPONSE"
done

Function Handler

When the Lambda function is invoked, the handler function in function.sh is called:

#!/bin/bash
export HOME=/tmp # so Git can write .gitconfig here
CLONE_DIR=/tmp/src-repo

# URL encode
SRC_USER=$(echo -n "$SRC_USER" | jq -sRr @uri)
SRC_PASS=$(echo -n "$SRC_PASS" | jq -sRr @uri)
DEST_USER=$(echo -n "$DEST_USER" | jq -sRr @uri)
DEST_PASS=$(echo -n "$DEST_PASS" | jq -sRr @uri)

SRC_REPO=${SRC_REPO/'https://'/"https://$SRC_USER:$SRC_PASS@"}
DEST_REPO=${DEST_REPO/'https://'/"https://$DEST_USER:$DEST_PASS@"}

function handler() {
  rm -rf $CLONE_DIR
  git clone --mirror "$SRC_REPO" $CLONE_DIR
  cd $CLONE_DIR
  git remote add dest "$DEST_REPO"
  git push dest --mirror
  echo 'DONE! Successfully mirrored source repo to destination!'
}

As seen above, the handler simply clones the CodeCommit repo & pushes it to the destination repo. Notice how jq is being used above the handler to URL encode the source & destination credentials. That’s because they’ll be embedded in the Git repo URLs to avoid Git prompting for credentials.

The --mirror option in git push above specifies that all refs under refs/ (refs/heads/, refs/remotes/, refs/tags/, etc) will be mirrored to the remote repository.

That covers everything about the Lambda function itself. Let us now look at the SAM template that brings all the pieces together.

SAM Template

The SAM template contains just 1 resource, the Lambda function:

Resources:
  AwsCodeCommitSync:
    Type: AWS::Serverless::Function
    Properties:
      FunctionName: aws-codecommit-sync
      PackageType: Image
      Timeout: 900
      ReservedConcurrentExecutions: 1
    Metadata:
      Dockerfile: Dockerfile
      DockerContext: .

As seen above, it’s best to limit the number of concurrent executions of this Lambda function to just 1. We don’t want multiple instances getting triggered in parallel in case a lot of CodeCommit events are captured in a short period of time.

EventBridge Rule

The Events property of the Lambda function creates the EventBridge rule that watches for CodeCommit events & triggers this Lambda:

Events:
  AllCodeCommitEvents:
    Type: EventBridgeRule
    Properties:
      Pattern:
        source:
          - aws.codecommit
        account:
          - !Ref AWS::AccountId
        region:
          - !Ref AWS::Region
        resources:
          - !Sub arn:aws:codecommit:${AWS::Region}:${AWS::AccountId}:${SourceCodeCommitRepoName}
        detail:
          repositoryName:
            - !Ref SourceCodeCommitRepoName

Template Parameters

The template expects the following parameters:

SOURCE

  • The name of the source CodeCommit repo, like source-repo.
  • The HTTPS Git clone URL of the source CodeCommit repo, like https://git-codecommit.ap-south-1.amazonaws.com/v1/repos/source-repo.
  • The Git username used to clone the source CodeCommit repo, like iam-user-at-123456789012.
  • And the Git password used to clone the source CodeCommit repo.

DESTINATION

  • The HTTPS Git push URL of the destination repo, like https://github.com/username/destination-repo.git.
  • The Git username used to push to the destination repo, like your GitHub username.
  • The Git password used to push to the destination repo. If using GitHub, create a personal access token with the repo scope & use it here.

All these parameters become environment variables to the Lambda function, which uses them to clone & push to the repositories.

Ready-to-Use App

This entire app is available on GitHub at https://github.com/harishkm7/aws-codecommit-sync. Just clone it to your system & follow the README!

Note: Do not manually push any changes to the destination repository. It will cause conflicts later when the Lambda function pushes changes to it from the source repository. Treat the destination as a read-only repository, and push all your development changes to your source repository only.

About the Author ✍🏻

Harish KM is a Principal DevOps Engineer at QloudX & a top-ranked AWS Ambassador since 2020. 👨🏻‍💻

With over a decade of industry experience as everything from a full-stack engineer to a cloud architect, Harish has built many world-class solutions for clients around the world! 👷🏻‍♂️

With over 20 certifications in cloud (AWS, Azure, GCP), containers (Kubernetes, Docker) & DevOps (Terraform, Ansible, Jenkins), Harish is an expert in a multitude of technologies. 📚

These days, his focus is on the fascinating world of DevOps & how it can transform the way we do things! 🚀

2 Replies to “A Serverless Solution to Keeping Git Repositories Synchronized”

  1. thanks for the article, was definitely useful.

  2. Abbas Bassam says:

    Thank you so much for the article. it s really amazing except the README file. Could it be more detailed 🙂 thank you in advanced.

Leave a Reply

Your email address will not be published. Required fields are marked *