Skip to content

Optimize the Pipeline Performance

Gems and Jewels to Collect

In this episode you will learn more about pipeline optimization techniques. Our first technique to be presented is caching. You may also be able to speed up the pipeline by setting a specific running order of jobs in a stage or even by removing stages completely from the pipeline. In order to save resources you may also interrupt running CI pipelines if a newer version of a particular CI pipeline starts.

Introduction

Sometimes a CI pipeline runs for a long time. The longer it takes the later we get feedback about code style violations, defects in the code or errors during execution of the application. As a rule of thumb you should start thinking about optimizing the pipeline as soon as it runs longer than roughly 10 minutes. In the following a couple of techniques are explored.

Caching and GitLab CI

The first technique that might come to our mind is caching. During a pipeline run, a lot of resources will be downloaded. Unless new versions are available, we could reuse those fetched files again in later CI jobs. Technically, this is possible because the CI runners are configured to utilize a separate caching service. Artifacts created during a CI pipeline can be uploaded to this service if the cache should be used and downloaded by the next CI job that reuses these cached files. One example in the context of Python packages is caching packages managed with dependency management systems like pip, pipenv or poetry.

stages:
  - lint
  - test

default:
  image: python:3.9

variables: # defining environment variables accessible in the whole pipeline
  PY_COLORS: '1' # colour python output
  CACHE_DIR: ".cache"
  CACHE_PATH: "$CI_PROJECT_DIR/$CACHE_DIR"
  POETRY_VIRTUALENVS_PATH: "$CACHE_PATH/venv"
  POETRY_CACHE_DIR: "$CACHE_PATH/poetry"
  PIP_CACHE_DIR: "$CACHE_PATH/pip"

.dependencies:
  before_script:
    - pip install --upgrade pip
    - pip install poetry
    - poetry install
  cache:
    key:
      files:
        - poetry.lock
      prefix: $CI_JOB_IMAGE
    paths:
      - "$CACHE_DIR"

reuse_compliance:
  stage: lint
  extends: .dependencies
  script:
    - poetry run reuse lint

linting:
  stage: lint
  extends: .dependencies
  script:
    - poetry run black --check --diff src/
    - poetry run isort --check --diff src/

test:python:
  image: python:${VERSION}
  stage: test
  extends: .dependencies
  script:
    - poetry run pytest tests/
  parallel:
    matrix:
      - VERSION: ["3.8", "3.9", "3.10"]

In this example, by using the cache keyword we declared a directory called .cache/ to be a directory that ought to be cached by GitLab CI. The key sub-key of this keyword gives each cache a unique identifying name. All CI jobs that reference the same cache name also use the same cache even if they are part of different CI pipeline runs. The Python dependencies are specified in a file called poetry.lock. If this file changes, the cache must be invalidated. Therefore, it is useful to use the file checksum of poetry.lock as the cache key. This can be achieved in GitLab CI by specifying the files subkey. Think carefully about the cache key so that caches are always used when possible and recreated if necessary.

Ultimately, you will notice that this pipeline is much faster compared to the same pipeline without caching. This is because all defined CI jobs reuse the .cache/ directory that contains the virtual environment and artifacts downloaded, managed and used by pip and poetry.

Defining Environment variables

The variables keyword is a way to define environment variables globally for the whole CI pipeline or specifically for particular CI jobs as part of the CI job definition. Here is an example:

variables:
  MEANING_OF_LIFE: "42"

You can access these environment variables the same way as you access Shell environment variables, i.e. by the variable name prefixed by a dollar sign, like $MEANING_OF_LIFE:

variables:
  MEANING_OF_LIFE: "42"

my_custom_job:
  script:
    - echo "The answer to life, the universe and everything is $MEANING_OF_LIFE."

In our example CI pipeline above we defined five environment variables:

  • Variable CACHE_DIR is the directory name that contains all cached artifacts.
  • Variable CACHE_PATH is the full path to the cache directory using a predefined CI variable called $CI_PROJECT_DIR, which is the path to the project directory on the CI runner where all the CI actions take place.
  • Variable POETRY_VIRTUALENVS_PATH defines the path of the virtual environment created and used by __Poetry_.
  • Variable POETRY_CACHE_DIR defines the path to the directory Poetry uses for caching.
  • Variable PIP_CACHE_DIR defines the path to the directory Pip uses for caching.

As you can see, you can use predefined variables like $CI_PROJECT_DIR as well as any other predefined variable in variables sections.

Later on in the pipeline these environment variables become handy because we know where the pipeline stores artifacts so that we can tell the GitLab CI runner to cache this particular .cache directory in which everything is included.

The needs keyword

Usually, the ordering of the CI jobs is given by the order of the stages. All stages will be executed in sequence, while all jobs in a stage will be executed in parallel. This is how it looks like in a diagram:

DAG with stages

This order can be changed with the needs keyword which defines another running order of CI jobs. Some CI job might need to wait for another CI job to finish successfully because it depends on the result of the first one. Two examples might be that the former job creates artifacts that the other one wants to reuse or the former job builds the application that is tested later on.

Both examples could use the needs keyword to define the running order, but they might have different implications depending on whether both jobs are contained in the same stage or not. If they are contained in the same stage, the second job does not run in parallel but in sequence after the first one. If they are contained in different stages, the second job will also be executed after the first one, but not necessarily after the whole stage of the pipeline finished. The depending job might be executed earlier than the stage ordering and immediately after the job finishes successfully that the second job depends on. Please note that the needs keyword takes a list of CI job names as values. These CI job names define the jobs that this job depends on. In consequence, the job with the needs keyword needs to wait for all mentioned jobs to finish successfully before it will be executed itself.

In the following, we will discuss two examples. In the first one we show how to reorder CI jobs inside the same stage. The objective would be to execute two jobs in one stage consecutively rather than in parallel. The first two jobs in the next example run in sequence, although they are part of the same stage. The third job starts as soon as the first stage completes successfully.

stages:
  - stage_1
  - stage_2

my_ci_job_1:
  stage: stage_1
  script:
    - echo "Execute job 1"
    - sleep 20

my_ci_job_2:
  stage: stage_1
  script:
    - echo "Execute job 2"
    - sleep 40
  needs: ["my_ci_job_1"]

my_ci_job_3:
  stage: stage_2
  script:
    - echo "Execute job 3"

For our example CI pipeline, we could run the license_compliance job before the lint job:

stages:
  - lint
  - test
  - run

default:
  image: python:3.9
  before_script:
    - pip install --upgrade pip
    - pip install poetry
    - poetry install

license_compliance:
  stage: lint
  script:
    - poetry run reuse lint

lint:
  stage: lint
  script:
    - poetry run black --check --diff .
    - poetry run isort --check --diff .
  needs: ["license_compliance"]

test:python:
  image: python:${VERSION}
  stage: test
  script:
    - poetry run pytest tests/
  parallel:
    matrix:
      - VERSION: ["3.8", "3.9", "3.10"]

run:
  stage: run
  script:
    - poetry run python -m astronaut_analysis
  rules:
    - if: $CI_COMMIT_BRANCH == $CI_DEFAULT_BRANCH
  artifacts:
    paths:
      - results/

Let us have a look at the changed diagram:

DAG with stages and needs in same stage

Second, we show how to set a new order across stages. Here, the objective is to let a depending job start as soon as the job it depends on finishes successfully. This way the later job does not wait until the whole previous stage passes. In the following example, the third job needs the first job to finish successfully. This makes the pipeline much faster because the third job does not wait for the slower second job to finish.

stages:
  - stage_1
  - stage_2

my_ci_job_1:
  stage: stage_1
  script:
    - echo "Execute job 1"
    - sleep 20

my_ci_job_2:
  stage: stage_1
  script:
    - echo "Execute job 2"
    - sleep 40

my_ci_job_3:
  stage: stage_2
  script:
    - echo "Execute job 3"
  needs: ["my_ci_job_1"]

In the case of our example, suppose we run the tests as soon as the lint job finishes successfully in our example CI pipeline, leaving the license_compliance aside. In addition, the run job could also be executed as soon as the test cases regarding the version 3.9 of the Python interpreter finishes successfully, leaving all other test jobs aside.

stages:
  - lint
  - test
  - run

default:
  image: python:3.9
  before_script:
    - pip install --upgrade pip
    - pip install poetry
    - poetry install

license_compliance:
  stage: lint
  script:
    - poetry run reuse lint

lint:
  stage: lint
  script:
    - poetry run black --check --diff .
    - poetry run isort --check --diff .

test:python:
  image: python:${VERSION}
  stage: test
  script:
    - poetry run pytest tests/
  parallel:
    matrix:
      - VERSION: ["3.8", "3.9", "3.10"]
  needs: ["lint"]

run:
  stage: run
  script:
    - poetry run python -m astronaut_analysis
  rules:
    - if: $CI_COMMIT_BRANCH == $CI_DEFAULT_BRANCH
  artifacts:
    paths:
      - results/
  needs: ["test:python: [3.9]"]

Now, the Directed Acyclic Graph (DAG) of that CI pipeline that displays the sequence and interrelations of the CI jobs looks like this:

DAG with stages and needs across stages

The overall pipeline might become much faster by introducing a new ordering that is not based on the stage ordering.

Please note the special notation of the needs keyword in the previous example in case of defining a dependency to a parameterized CI job. There, the specific parameter value needs to be given in squared brackets alongside the job name, in this example this is test:python: [3.9].

Stageless Pipelines

Stageless CI pipelines do not define any stages. They leave out the stages keyword completely and set the running order of the CI jobs only by using the needs keyword. Your pipeline might look similar to this example that executes the first two jobs in parallel but the third one after the first one:

my_ci_job_1:
  script:
    - echo "Execute job 1"
    - sleep 20

my_ci_job_2:
  script:
    - echo "Execute job 2"
    - sleep 40

my_ci_job_3:
  script:
    - echo "Execute job 3"
  needs: ["my_ci_job_1"]

By applying this concept to our example CI pipeline, we could make the tests depend on the lint job and the run job on the test:python: [3.9] job. This is equivalent to the following stageless pipeline:

default:
  image: python:3.9
  before_script:
    - pip install --upgrade pip
    - pip install poetry
    - poetry install

license_compliance:
  script:
    - poetry run reuse lint

lint:
  script:
    - poetry run black --check --diff .
    - poetry run isort --check --diff .

test:python:
  image: python:${VERSION}
  script:
    - poetry run pytest tests/
  parallel:
    matrix:
      - VERSION: ["3.8", "3.9", "3.10"]
  needs: ["lint"]

run:
  script:
    - poetry run python -m astronaut_analysis
  rules:
    - if: $CI_COMMIT_BRANCH == $CI_DEFAULT_BRANCH
  artifacts:
    paths:
      - results/
  needs: ["test:python: [3.9]"]

The resulting CI pipeline might be faster than the CI pipeline with stages defined. The corresponding DAG of the CI pipeline shown above is depicted in the following diagram:

DAG stageless

Help Saving Resources

The last keyword to be explained in this episode is the interruptible keyword. It might be unreasonable to continue executing CI pipelines if a newer version of a pipeline is about to start just to save infrastructure resources. By setting CI jobs as interruptible these jobs are allowed to be canceled before they finished running. If a job is interrupted the whole pipeline will be stopped in favour of a newer one. This could be exemplified like so:

my_job_1:
  script:
    - echo "Execute job 1"
    - sleep 15
  interruptible: true

my_job_2:
  script:
    - echo "Execute job 2"
    - sleep 30
  interruptible: true

my_job_3:
  script:
    - echo "Execute job 3"
  needs: ["my_job_2"]

For our own CI pipeline, we allow all lint and test jobs to be interruptible but not the run job because this one creates artifacts that could be incomplete or missing if the job is canceled:

default:
  image: python:3.9
  before_script:
    - pip install --upgrade pip
    - pip install poetry
    - poetry install

license_compliance:
  script:
    - poetry run reuse lint
  interruptible: true

lint:
  script:
    - poetry run black --check --diff .
    - poetry run isort --check --diff .
  interruptible: true

test:python:${VERSION}:
  image: python:${VERSION}
  script:
    - poetry run pytest tests/
  parallel:
    matrix:
      - VERSION: ["3.8", "3.9", "3.10"]
  needs: ["lint"]
  interruptible: true

run:
  script:
    - poetry run python -m astronaut_analysis
  rules:
    - if: $CI_COMMIT_BRANCH == $CI_DEFAULT_BRANCH
  artifacts:
    paths:
      - results/
  needs: ["test:python: [3.9]"]

Consequently, the pipeline will be stopped if the run job has not been reached when a newer version of the pipeline starts. This might save a lot of resources in the long run and might not block CI runners for a long time if CI pipelines are computational expensive and run for a significant amount of time.

Note: When using the needs keyword, artifacts are not downloaded from previous stages anymore. In those jobs which use the needs keyword you can enable downloading artifacts from the CI job that the job depends on by using the artifacts sub-key like this:

stages:
  - build
  - test

default:
  image: python:3.9

build:
  stage: build
  script:
    - make build
  artifacts:
    paths:
      - build

test:
  stage: test
  script:
    - make test
  needs:
    - job: build
      artifacts: true
Exercise

Exercise 1: Optimize CI Pipeline Performance of the Exercise Project

The following exercise is about optimizing the CI pipeline for our exercise project. Remember: - Caching will speed up your pipeline, even though it might not be applicable in the exercise project. - Using stageless pipelines helps to avoid CI jobs blocking each other. - Making jobs interruptible will cancel a pipeline if a newer run has started, thus saving resources.

Take Home Messages

In this episode we were presenting some more concepts to optimize the whole CI pipeline. This can be done by caching dependencies, by defining a more efficient running order in a pipeline or even by defining stageless pipelines. Last, we were configuring CI jobs as interruptible so that a pipeline can be stopped in favour of a newer pipeline which saves infrastructure resources.

Next Episodes

In the last episode of this workshop we will work again on the topic of removing duplications and reusing particular parts of the CI pipeline.