Skip to content

Remove Redundancies

Gems and Jewels to Collect

In the course of this episode you will learn a couple of techniques to remove redundancies in your GitLab CI pipeline, so that the file is easier to read and much easier to maintain.

Introduction

Up to now the CI pipeline does its job:

  • It makes sure the source code complies to coding and licensing guidelines,
  • tests are automatically executed, and
  • it also runs the application.

The implementation, though, comes with a lot of duplications and redundancies.

A very popular principle in software engineering and beyond is the DRY principle - Don’t Repeat Yourself. This means, you should not repeat concepts you already introduced somewhere else but add them in a way to your project that you can reuse them in different contexts. The most important reason is maintainability. If you touch certain aspects of your code or documentation you do not want to search through the whole code base or documentation for duplications. Because this manual step is failure prone, you will most probably miss out on important parts which introduces inconsistencies.

A GitLab CI pipeline can quickly grow in terms of lines of code. You should constantly take care about repetitions in your pipeline. Refactor your pipeline as soon as you are about to introduce duplications. In this lesson, we will learn how these redundancies can be removed while keeping the same functionality of the pipeline.

Set Global Defaults for Keywords

GitLab CI allows defining global default values in CI pipelines with the default keyword. A subset of those keywords that are also applicable in the default section is given in the following list; the script keyword, for instance, is not allowed to be defined as a default:

Let’s have an example:

stages:
  - test

default:
  image: python:3.9
  before_script:
    - pip install --upgrade pip
    - pip install poetry
    - poetry install

test:python:
  stage: test
  script:
    - poetry run pytest tests/

test:python:3.10:
  stage: test
  image: python:3.10
  script:
    - poetry run pytest tests/

As you can see all defaults like a default image and default before_script Shell commands can be subordinated in the default section of the CI pipeline. These defaults are used in all CI jobs as long as they are not overridden there. Please compare those CI jobs above. Both use the default before_script section, but only the first job uses the default image set, while the test:python:3.10 job overrides the global image keyword. It needs to be mentioned here that defaults can be written down without using the default keyword at all:

stages:
  - test

image: python:3.9
before_script:
- pip install --upgrade pip
- pip install poetry
- poetry install

test:python:
  stage: test
  script:
    - poetry run pytest tests/

test:python:3.10:
  stage: test
  image: python:3.10
  script:
    - poetry run pytest tests/

Reuse Configurations

A powerful keyword to reduce repetitions in your pipeline is the extends keyword. A similar concept in YAML that can be used for this purpose are YAML anchors.

First, let us explain the extends keyword which appears to be simpler. The essence of this keyword is that you may add a block of YAML in the CI pipeline that is not a CI job and is therefore not executed on its own but some reusable block that can be referenced later on in the YAML file in different locations. If you use this block of YAML somewhere in a CI job definition with the extends keyword, all defaults will be overridden in the same way as it is done in CI jobs that do not use the extends keyword.

Let us explore the following example:

stages:
  - stage_1

.my_extension: # block to be reused in CI jobs
  stage: stage_1
  before_script:
    - echo "Output in before_script section."
  script: 
    - echo "Output in script section."

my_ci_job_1:
  extends: .my_extension # reuse block in CI job

my_ci_job_2:
  extends: .my_extension # reuse block in CI job

The example shows that those names of reusable blocks have a leading period, e.g. .my_extension, and that they can be reused with the extends keyword inside your CI jobs or even inside other reusable blocks by specifying the name of the reusable block.

Let us assume for the sake of this example that you do not want to use the default section for the before_script block but declare another block that you can refer to multiple times somewhere else.

stages:
  - stage_1

.my_extension: # reusable YAML block
  stage: stage_1
  before_script:
    - echo "Output 1 in before_script section."
    - echo "Output 2 in before_script section."
  script: 
    - echo "Output in script section."

my_ci_job_1:
  extends: .my_extension # reuse block in CI job

my_ci_job_2:
  extends: .my_extension # reuse block in CI job

For our example CI pipeline we could write this down as follows:

stages:
  - test

default:
  image: python:3.9

.do_testing: # block to be reused in CI jobs
  stage: test
  before_script:
    - pip install --upgrade pip
    - pip install poetry
    - poetry install
  script:
    - poetry run pytest tests/

test:python:3.9:
  extends: .do_testing # reuse block in CI job

test:python:3.10:
  image: python:3.10
  extends: .do_testing # reuse block in CI job

The greatest benefit of using the extends keyword is that you only have a single location which you need to change if you decide to adapt, for example, the Shell command in the script section and add some command-line options for the Pytest call. The extends keyword does not work with YAML lists, though. For these cases YAML has got a concept called YAML anchors.

YAML anchors are very similar to extensions but have a slightly different syntax. There are even two different syntaxes depending on the context. The example above could look like this if YAML anchors were used:

stages:
  - stage_1

.my_sequence_anchor: &my-sequence-anchor-name # reusable block as YAML sequence
  - echo "Output 2 in before_script section."
  - echo "Output 3 in before_script section."

.my_hash_anchor: &my-hash-anchor-name # reusable block as nested YAML hash
  stage: stage_1
  before_script:
    - echo "Output 1 in before_script section."
    - *my-sequence-anchor-name # reuse YAML sequence in nested YAML hash
  script: 
    - echo "Output in script section."

my_ci_job_1:
  <<: *my-hash-anchor-name # reuse nested YAML hash in CI job

my_ci_job_2:
  <<: *my-hash-anchor-name # reuse nested YAML hash in CI job

Here, we defined two blocks, one that is a simple YAML sequence and one that is a nested YAML hash. Then, the former one is used in the later block. As you can see, in contrast to extensions you can use them for YAML sequences and for nested YAML hashes. Declaring such a reusable YAML block is done by assigning a name to the block prefixed by a period, followed by a colon and an anchor name with a leading ampersand, e.g. .my_sequence_anchor: &my-sequence-anchor-name or .my_hash_anchor: &my-hash-anchor-name. Referring to a block is done either by writing an asterisk followed by the anchor name in case of a YAML sequence, for example, *my-sequence-anchor-name, or by writing two lower-than signs and a colon, followed by an asterisk and the anchor name in case of a nested YAML hash, for example, <<: *my-hash-anchor-name.

The corresponding implementation for the example CI pipeline could look like this:

stages:
  - test

default:
  image: python:3.9

.before_testing: &before-testing # a reusable block as a YAML sequence
  - pip install --upgrade pip
  - pip install poetry
  - poetry install

.do_testing: &do-testing # a reusable block as a nested YAML hash
  stage: test
  before_script:
    - *before-testing # reuse YAML sequence in nested YAML hash
  script:
    - poetry run pytest tests/

test:python:3.9:
  <<: *do-testing # reuse nested YAML hash in CI job

test:python:3.10:
  image: python:3.10
  <<: *do-testing # reuse nested YAML hash in CI job

Both versions, the extends keyword and the YAML anchors, of reusable YAML improves readability as well as maintainability significantly, because many duplications were stripped away.

Use matrix Jobs

In this section we will first talk about a simple example for the so-called matrix keyword and then extend the concept to a more general one. The motivation is to provide a list of variable values with n elements to a single CI job definition so that it is instantiated n times. Let us consider our previous test stage that contains three CI jobs testing with different Python interpreter versions. To reduce these redundancies arising from our three test jobs in our current CI pipeline, we would like to define it just once but instantiate it several times by providing a list of variable values. This can be done with the matrix keyword that defines parameterized CI jobs. By applying this approach to our example CI pipeline we arrive at the following pipeline:

stages:
  - test

.do_testing:
  stage: test
  before_script:
    - pip install --upgrade pip
    - pip install poetry
    - poetry install
  script:
    - poetry run pytest tests/

test:python:
  image: python:${VERSION}
  extends: .do_testing
  parallel:
    matrix:
      - VERSION: ["3.8", "3.9", "3.10"]

As a result, the respective CI job is defined only once and the duplications have been removed nicely. With this version of a CI job definition, a corresponding CI job instance will be created for each list item, i.e. three times. All of these jobs will then be executed in parallel, because they belong to the same stage since the parameterized CI job is assigned to stage test.

Matrices in a More General Context

So far we used the matrix keyword just with one list of variable values, but it is even capable of working with matrices as the name of the keyword implies. A m x n matrix is a table like structure with m rows and n columns. It can be used for noting down different permutations of two lists of items:

b1 b2 b3 b4
a1 c11 c12 c13 c14
a2 c21 c22 c23 c24
a3 c31 c32 c33 c34

In this example we get twelve permutations regarding two lists consisting of three and four items, respectively.

The matrix keyword is a similar concept in GitLab CI. Two variables with m and n values can be specified in a CI job, so that m times n instances of a CI job are executed in parallel like in this example:

stages:
  - run

my_ci_job:
  stage: run
  image: python
  script:
    - python -m my_python_module --param1 ${PARAMETER_1} --param2 ${PARAMETER_2}
  parallel:
    matrix:
      - PARAMETER_1: ["1", "2", "3"]
        PARAMETER_2: ["1", "2"]

This is the table of the different permutations of each of the elements in both lists:

PARAMETER_1 PARAMETER_2
1 1
1 2
2 1
2 2
3 1
3 2

The matrix keyword is even working with multiple matrices such as two matrices each with two lists of m times n elements. This simplifies the YAML file quite a bit since the job is specified only once. Please be aware of the limitation that the number of permutations in a parameterized CI job can not exceed 50.

Exercise

Exercise 1: Refactor CI Pipeline and Remove Redundancies in Exercise Project

Technical debt builds up quickly, so removing redundancy and refactoring your CI pipeline should start as early as possible and repeated on a regular basis.

In this exercise we’ll practice refactoring our CI pipeline from the exercise project. Remember:

  • Define defaults globally with the default keyword.
  • Reuse YAML blocks in different CI jobs with the extends keyword and YAML anchors.
  • Execute parameterized CI jobs in parallel with the matrix keyword.

Please note that the dependencies keyword does not work in combination with the matrix keyword. You can not refer to a specific parameterized CI job, e.g. the build job, in the dependencies section of the test job, that means, this does not work:

build:gcc:
  [...]
  image: gcc:${VERSION}
  artifacts:
    paths:
      - "build"
  parallel:
    matrix:
      - VERSION: [ "10", "11", "12" ]
  [...]

test:gcc:
  [...]
  image: gcc:${VERSION}
  dependencies:
    - "build:gcc [${VERSION}]"
  parallel:
    matrix:
      - VERSION: [ "10", "11", "12" ]
  [...]
As a work-around you can create a build directory per compiler version in the build job and name it in the artifacts section:
build:gcc:
  [...]
  image: gcc:${VERSION}
  artifacts:
    paths:
      - build_gcc_${VERSION}
  parallel:
    matrix:
      - VERSION: [ "10", "11", "12" ]
  [...]
Then, you can access the build artifacts in the test job by referring to the build job without specifying the version parameter:
test:gcc:
  [...]
  image: gcc:${VERSION}
  dependencies:
    - "build:gcc"
  parallel:
    matrix:
      - VERSION: [ "10", "11", "12" ]
  [...]
The artifacts of the parameterized CI job will contain three build folders, one for each compiler version. Here is a generic example:
build:
  image: my_image:${VERSION}
  stage: build
  script:
    - make build_${VERSION} # building the app in directory build_[1:3]
  artifacts:
    paths:
      - build_${VERSION} # one artifact directory per parameterized CI job
  parallel:
    matrix:
      - VERSION: [ "1", "2", "3" ]

test:
  image: my_image:${VERSION}
  stage: test
  script:
    - make test_${VERSION} # testing the app in directory build_[1:3]
  dependencies:
    - "build" # refer to all artifact directories build_[1:3] from CI job build
  parallel:
    matrix:
      - VERSION: [ "1", "2", "3" ]

Take Home Messages

In this episode you learned how to simplify our CI pipeline by using defaults, the extends keyword, YAML anchors and the matrix keyword. The default keyword reduces duplications because defaults are set once for the whole pipeline. The extends keyword and YAML anchors provide a way to define reusable blocks of YAML. The matrix keyword let the pipeline shrink as well by parameterizing the pipeline which creates instances of a CI job for all permutations of the parameters given.

Next Episodes

In the upcoming episodes we focus on further performance improvements and pipeline optimizations.