How I Structure GitLab CI/CD Pipelines

Most GitLab CI tutorials show you a 20-line .gitlab-ci.yml that runs npm test. That's fine for a side project. But when you're deploying a monorepo with a frontend, backend, infrastructure-as-code, and container images across three environments — you need something more intentional.

This post walks through the patterns I use in production pipelines. Every example is drawn from a real project (simplified for clarity), not a contrived demo.

The Problem with One Big File

The default approach — one .gitlab-ci.yml with every job — falls apart fast. When you have 30+ jobs across validate, test, build, plan, deploy, and notify stages, a single file becomes unmanageable. Nobody wants to scroll through 800 lines of YAML to find the deploy job they need to tweak.

Pattern 1: Modular Includes

Split your pipeline into domain-specific files and compose them with include:

# .gitlab-ci.yml (root)
include:
  - local: '.gitlab/ci/shared/shared.gitlab-ci.yml'
  - local: '.gitlab/ci/frontend.gitlab-ci.yml'
  - local: '.gitlab/ci/api.gitlab-ci.yml'
  - local: '.gitlab/ci/container-service.gitlab-ci.yml'
  - local: '.gitlab/ci/infrastructure.gitlab-ci.yml'
  - local: '.gitlab/ci/security.gitlab-ci.yml'
  - local: '.gitlab/ci/sandbox.gitlab-ci.yml'
  - local: '.gitlab/ci/ops.gitlab-ci.yml'

Order matters. GitLab processes includes sequentially. If api.gitlab-ci.yml references a YAML anchor defined in shared.gitlab-ci.yml, the shared file must come first. Get this wrong and you'll see cryptic "unknown keys" errors.

I organize the file structure like this:

.gitlab/
├── ci/
│   ├── shared/
│   │   ├── shared.gitlab-ci.yml    # Variables, rules, anchors
│   │   ├── templates.gitlab-ci.yml # Reusable job templates
│   │   └── debug.gitlab-ci.yml    # Pipeline debug/diagnostics
│   ├── frontend.gitlab-ci.yml      # Frontend test/build/deploy
│   ├── api.gitlab-ci.yml           # Backend test/build/deploy
│   ├── container-service.gitlab-ci.yml # Container image build/push/deploy
│   ├── infrastructure.gitlab-ci.yml # Terraform plan/apply
│   ├── security.gitlab-ci.yml      # SAST, dependency audit, IaC scan, DAST
│   ├── sandbox.gitlab-ci.yml       # Ephemeral environments
│   └── ops.gitlab-ci.yml           # Promotion, notifications, reviewer assignment
└── README.md

Each domain file is self-contained: it defines the test, build, and deploy jobs for that service. An engineer working on the frontend only needs to look at frontend.gitlab-ci.yml. Security scans live in their own file so you can toggle advisory vs. blocking mode without touching any domain pipeline.

Pattern 2: Environment Branching with YAML Anchors

I use an environment branching strategy: dev (default) → stage → prod. Each branch maps to an AWS account and URL. The trick is making every job automatically resolve the right environment without hardcoding anything.

First, define variables per environment using YAML anchors:

# shared.gitlab-ci.yml

.vars-dev: &vars-dev
  AWS_ACCOUNT: $AWS_ACCOUNT_DEV
  ENVIRONMENT: dev
  ENVIRONMENT_URL: https://dev.example.com

.vars-stage: &vars-stage
  AWS_ACCOUNT: $AWS_ACCOUNT_STAGE
  ENVIRONMENT: stage
  ENVIRONMENT_URL: https://stage.example.com

.vars-prod: &vars-prod
  AWS_ACCOUNT: $AWS_ACCOUNT_PROD
  ENVIRONMENT: prod
  ENVIRONMENT_URL: https://example.com

Then define branch/MR detection rules:

.if-dev-commit: &if-dev-commit
  if: '$CI_COMMIT_REF_NAME == "dev" && $CI_PIPELINE_SOURCE == "push"'

.if-dev-mr: &if-dev-mr
  if: '$CI_PIPELINE_SOURCE == "merge_request_event" && $CI_MERGE_REQUEST_TARGET_BRANCH_NAME == "dev"'

# Same pattern for stage and prod...

Now compose them into atomic rule entries that bundle the condition with its variables:

.rule-dev-commit: &rule-dev-commit
  <<: *if-dev-commit
  interruptible: false  # Never cancel a deployment in progress
  variables:
    <<: [*vars-dev]

.rule-dev-mr: &rule-dev-mr
  <<: *if-dev-mr
  interruptible: true  # Safe to cancel MR pipelines
  variables:
    <<: [*vars-dev]

And finally, full rule sets that jobs can reference:

.rules:all:mr:commit:
  rules:
    - <<: *rule-dev-mr
    - <<: *rule-dev-commit
    - <<: *rule-stage-mr
    - <<: *rule-stage-commit
    - <<: *rule-prod-mr
    - <<: *rule-prod-commit

This means any job can simply extends: [ .rules:all:mr:commit ] and it automatically gets the correct ENVIRONMENT, AWS_ACCOUNT, and ENVIRONMENT_URL — no if/else logic needed in the job itself.

Pattern 3: Reusable Job Templates

Define base jobs that encapsulate common setup, then extend them:

# templates.gitlab-ci.yml

.cache:npm:
  cache:
    - key:
        files: [package-lock.json]
        prefix: npm-cache
      paths: [.npm/]
      policy: pull-push

.node:base:
  extends: [.cache:npm]
  image: node:20
  before_script:
    - npm ci --cache .npm --prefer-offline

.test:base:
  extends: [.node:base, .rules:all:mr:commit]
  stage: test
  needs: []
  script:
    - npm run ${TEST_COMMAND}
  coverage: '/All files[^|]*\|[^|]*\s+([\d\.]+)/'

Now domain-specific test jobs become minimal:

# api.gitlab-ci.yml

test:api:
  extends: [.test:base]
  variables:
    APP_PATH: $API_DIR
    TEST_COMMAND: "test:api"

# frontend.gitlab-ci.yml

test:frontend:
  extends: [.test:base]
  variables:
    APP_PATH: $FRONTEND_DIR
    TEST_COMMAND: "test:frontend"

Each test job is 5 lines. All the npm caching, coverage parsing, and environment rules are inherited. When you need to change how tests run globally, you edit one template.

Pattern 4: Change Detection for MR Pipelines

In a monorepo, you don't want frontend tests re-running when someone changes a Terraform file. Change detection solves this — but only for MR pipelines. Commit pipelines to deployment branches always run everything (you want full confidence before deploying).

Define paths per domain:

.paths-frontend: &paths-frontend
  - .gitlab-ci.yml
  - .gitlab/ci/shared/**/*
  - .gitlab/ci/frontend.gitlab-ci.yml
  - package.json
  - package-lock.json
  - apps/frontend/**/*

.paths-api: &paths-api
  - .gitlab-ci.yml
  - .gitlab/ci/shared/**/*
  - .gitlab/ci/api.gitlab-ci.yml
  - apps/api/**/*

Then create rules that layer change detection on top of the base rules:

.rules:frontend:mr:commit:
  rules:
    # For dev MRs: only run if frontend files changed
    - <<: *rule-dev-mr
      changes:
        paths: *paths-frontend
    # Skip dev MR if no changes matched
    - <<: *if-dev-mr
      when: never
    # All other pipelines: run normally
    - !reference [.rules:all:mr:commit, rules]

The key insight: the first rule says "run on dev MRs if these files changed." The second rule says "otherwise, skip on dev MRs." All other rules (stage/prod MRs, commit pipelines) fall through unchanged. This means change detection is surgical — it only applies to dev MR pipelines where fast feedback matters most.

Pattern 5: Stages That Tell a Story

Don't just use test, build, deploy. Your stages should describe your deployment flow:

stages:
  - .pre          # Debug variables, ECR login, auth tokens
  - validate      # Lint, terraform fmt/validate
  - security      # SAST, dependency audit, IaC scan, container scan
  - test          # Unit & integration tests (parallel)
  - build         # Docker images, frontend bundles, Lambda zips
  - infra-plan    # Terraform plan (preview)
  - infra-apply   # Terraform apply (provision)
  - deploy        # Push images, deploy apps
  - verify        # Health checks, DAST scans
  - notify        # Teams/Slack notifications
  - .post         # Cleanup, promotion MRs

Splitting infra-plan and infra-apply into separate stages is intentional. The plan runs on every pipeline (including MRs) so reviewers can see what infrastructure changes a code change will trigger. The apply only runs on commit pipelines to deployment branches.

The security stage runs in parallel with tests — static analysis doesn't depend on a build, so there's no reason to wait. Dynamic analysis (DAST) runs later in verify because it needs a live deployed target to scan against.

Pattern 6: Security Scanning Pipeline

Security scanning shouldn't be an afterthought bolted onto CI. I run five layers of scanning, each covering a different attack surface:

Scan	Tool	What it catches	When it runs
SAST	Semgrep	Code-level vulnerabilities (OWASP Top 10, secrets)	Dev MRs + dev commits
Dependencies	npm audit	Known CVEs in packages	Dev MRs + dev commits
IaC	Trivy config	Terraform misconfigurations	When infra files change
Containers	Trivy image	OS/library CVEs in Docker images	When container files change
DAST	OWASP ZAP	Runtime vulnerabilities in live API	Post-deploy to stage

The first four run in the security stage (pre-deploy). DAST runs in verify (post-deploy) because it needs a live target.

Advisory mode: visible but non-blocking

Every security job uses allow_failure: true. The pipeline stays green, but a failed security job shows a red X — visible in the MR and pipeline views. This gives you signal without blocking deployments while you triage the initial baseline.

sast:semgrep:
  stage: security
  image:
    name: semgrep/semgrep:latest
    entrypoint: [""]
  needs: []
  rules:
    - !reference [.rules:dev:mr:commit, rules]
  script:
    - mkdir -p security-results/semgrep
    - >
      semgrep scan
      --config p/owasp-top-ten
      --config p/javascript
      --config p/typescript
      --config p/secrets
      --gitlab-sast
      --gitlab-sast-output security-results/semgrep/gl-sast-report.json
      apps/
  artifacts:
    when: always
    paths:
      - security-results/semgrep/gl-sast-report.json
    reports:
      sast: security-results/semgrep/gl-sast-report.json
  allow_failure: true

Once you've triaged the baseline, flip allow_failure: false per scan type to make it blocking. You can do this incrementally — start with SAST (fewest false positives), then dependencies, then IaC.

Dual output: machines and humans

Each scan produces two artifacts: a machine-readable JSON report for GitLab's Security Dashboard (artifacts.reports.sast, artifacts.reports.container_scanning, artifacts.reports.dast) and a human-readable text file you can browse directly from the pipeline artifact viewer. The dashboard aggregates findings across MRs; the text output lets you triage without leaving the pipeline.

Container scanning with matrix jobs

When you have multiple container images, use parallel: matrix to scan each one as a separate job:

scan:container:
  stage: security
  image:
    name: aquasec/trivy:latest
    entrypoint: [""]
  needs:
    - job: build:container-service
      artifacts: true
  parallel:
    matrix:
      - CONTAINER_NAME: converter-service
        CONTAINER_TAR: "${CI_PROJECT_DIR}/converter-image.tar"
      # Add more images here as your project grows
  script:
    - mkdir -p "security-results/trivy-container/${CONTAINER_NAME}"
    - >
      trivy image
      --input "${CONTAINER_TAR}"
      --severity "HIGH,CRITICAL"
      --format template
      --template "@/contrib/gitlab.tpl"
      --output "security-results/trivy-container/${CONTAINER_NAME}/gl-container-scanning-report.json"
      --exit-code 1
  artifacts:
    when: always
    reports:
      container_scanning: "security-results/trivy-container/${CONTAINER_NAME}/gl-container-scanning-report.json"
  allow_failure: true

Adding a new image is one matrix entry. The Trivy --input flag scans a tarball from the build stage rather than pulling from a registry — the image doesn't need to be pushed yet.

Scope limiting: scan once, promote with confidence

Security scans only run on dev MRs and dev commits. Stage and prod are promotion pipelines — the code is identical to what already passed scanning on dev. Re-running SAST on a promotion MR is wasted compute.

The exception is IaC scanning, which runs on all environments because Terraform configs can differ per environment (different instance sizes, different feature flags in tfvars).

DAST: testing the live API

OWASP ZAP runs post-deploy against the stage environment. It spiders the API for 5 minutes, then runs passive and active rules:

dast:zap-baseline:
  stage: verify
  image:
    name: ghcr.io/zaproxy/zaproxy:stable
    entrypoint: [""]
  needs:
    - job: deploy:api
      artifacts: false
  rules:
    - !reference [.rules:stage:commit, rules]
  script:
    - mkdir -p security-results/zap
    - >
      zap-baseline.py
      -c .zap.yml
      -m 5
      -t "https://stage.example.com/api/"
      -J security-results/zap/gl-dast-report.json
      -r security-results/zap/zap-report.html
      -l WARN
  artifacts:
    when: always
    reports:
      dast: security-results/zap/gl-dast-report.json
  allow_failure: true

Stage-only is intentional — you need a deployed target, and you don't want ZAP hammering production.

Pattern 7: Container Image Builds with Kaniko

If your pipeline builds Docker images, you've probably fought with Docker-in-Docker (DinD). It requires privileged mode on the runner, it's slow (starts a Docker daemon every job), and it's a security surface you don't need.

Kaniko builds container images without a Docker daemon. It runs as a regular container — no privileges, no DinD service, no socket mounting.

Build and push as separate jobs

I split the container pipeline into three stages: build (with --no-push), scan, then push. This keeps scanning in the critical path without requiring registry access:

build:container-service:
  stage: build
  image:
    name: gcr.io/kaniko-project/executor:debug
    entrypoint: [""]
  script:
    - /kaniko/executor
      --context "${APP_PATH}"
      --dockerfile "${APP_PATH}/Dockerfile"
      --destination "${ECR_URI}:latest"
      --destination "${ECR_URI}:${CI_COMMIT_SHORT_SHA}"
      --tar-path "${CI_PROJECT_DIR}/service-image.tar"
      --no-push
  artifacts:
    paths:
      - service-image.tar
    expire_in: 1 day

The --no-push flag builds the image and saves it as a tarball artifact. Trivy scans the tarball in the security stage (Pattern 6). Only on commit pipelines — after tests, scans, and builds all pass — does the image get pushed:

push:container-service:
  stage: deploy
  image:
    name: gcr.io/kaniko-project/executor:debug
    entrypoint: [""]
  rules:
    - !reference [.rules:all:commit, rules]
  needs:
    - ecr-login
    - build:container-service
  script:
    - /kaniko/executor
      --context "${APP_PATH}"
      --dockerfile "${APP_PATH}/Dockerfile"
      --destination "${ECR_URI}:latest"
      --destination "${ECR_URI}:${CI_COMMIT_SHORT_SHA}"

ECR authentication in `.pre`

Registry login runs once as a .pre job and passes the token as a short-lived artifact:

ecr-login:
  stage: .pre
  extends: [.aws_credentials]
  script:
    - aws ecr get-login-password --region ${AWS_DEFAULT_REGION} > ecr-token.txt
  artifacts:
    paths: [ecr-token.txt]
    expire_in: 60 minutes

Downstream Kaniko jobs read this token and write their own /kaniko/.docker/config.json. The 60-minute expiry means the token is never sitting around longer than one pipeline run.

Dual tagging

Every push tags with both latest and $CI_COMMIT_SHORT_SHA. latest is convenient for dev workflows. The SHA tag gives you immutable, auditable references — you can always trace exactly which commit is running in each environment.

Pattern 8: Automated Waterfall Promotion

After a successful deployment to dev, I automatically create an MR to promote to stage. After stage succeeds, same thing for prod. This creates a consistent, auditable promotion path without manual intervention.

mr_dev_to_stage:
  stage: .post
  image: registry.gitlab.com/gitlab-org/cli:latest
  rules:
    - if: '$CI_COMMIT_REF_NAME == "dev" && $CI_PIPELINE_SOURCE == "push"'
  allow_failure: true
  script:
    - |
      glab mr create \
        --source-branch dev \
        --target-branch stage \
        --title "Promote Dev to Stage" \
        --description "Automatic promotion from pipeline $CI_PIPELINE_ID." \
        --yes --remove-source-branch=false

auto_merge_dev_to_stage:
  stage: .post
  image: registry.gitlab.com/gitlab-org/cli:latest
  rules:
    - if: '$CI_PIPELINE_SOURCE == "merge_request_event" && $CI_MERGE_REQUEST_TARGET_BRANCH_NAME == "stage"'
  when: on_success
  script:
    - glab mr merge ${CI_MERGE_REQUEST_IID} --yes --squash=false --remove-source-branch=false

The flow: code merges to dev → pipeline runs → .post stage creates MR (dev→stage) → stage MR pipeline runs all validations → on success, auto-merges → stage pipeline runs → creates MR (stage→prod) → same pattern.

allow_failure: true is important here — the MR creation will fail if one already exists, and that's fine.

Pattern 9: Smart Defaults

Set sensible defaults at the pipeline level so individual jobs stay clean:

default:
  image: node:20
  artifacts:
    expire_in: 1 day
  interruptible: true
  retry:
    max: 1
    when:
      - runner_system_failure
      - stuck_or_timeout_failure

Key decisions:

interruptible: true by default — new commits cancel stale MR pipelines (with workflow: auto_cancel). Override to false for deploy jobs.
retry on infrastructure failures — flaky runners shouldn't block your pipeline. But only retry on system failures, not script failures (that's a real bug).
Short artifact expiry — 1 day for build artifacts, with deploy jobs overriding to 30 days when needed.

Pattern 10: Failure Notifications and Status Dashboards

Pipeline failures should be impossible to miss. I send Adaptive Card payloads to Teams with the specific failed job name and a direct link:

notify_failure:
  stage: notify
  image: alpine:latest
  when: on_failure
  rules:
    - if: '$CI_COMMIT_BRANCH == "dev" || $CI_COMMIT_BRANCH == "stage" || $CI_COMMIT_BRANCH == "prod"'
  script:
    - apk add --no-cache curl jq
    - |
      FAILED_JOBS=$(curl -s \
        --header "PRIVATE-TOKEN: ${GITLAB_TOKEN}" \
        "${CI_API_V4_URL}/projects/${CI_PROJECT_ID}/pipelines/${CI_PIPELINE_ID}/jobs?scope[]=failed")
      FAILED_JOB_NAME=$(echo "$FAILED_JOBS" | jq -r '.[0].name // "Unknown"')
    - |
      # Build and send Adaptive Card payload
      curl -H "Content-Type: application/json" \
        -d "{\"text\": \"Pipeline failed in ${CI_PROJECT_NAME} (${CI_COMMIT_REF_NAME}): ${FAILED_JOB_NAME}\"}" \
        "$WEBHOOK_URL"

Only trigger notifications on deployment branches — nobody needs a Teams ping for a failing MR pipeline that's still in progress.

Environment status dashboard

Beyond failure alerts, I also send a status dashboard card that shows the health of all three environments at a glance. The job queries the GitLab API for the latest pipeline status on each deployment branch and renders a compact Adaptive Card:

notify_status:
  stage: notify
  image: alpine:latest
  rules:
    - if: '$CI_COMMIT_BRANCH == "dev" || $CI_COMMIT_BRANCH == "stage" || $CI_COMMIT_BRANCH == "prod"'
  script:
    - apk add --no-cache curl jq
    - |
      get_pipeline_info() {
        PIPELINE_JSON=$(curl -s --header "PRIVATE-TOKEN: ${GITLAB_TOKEN}" \
          "${CI_API_V4_URL}/projects/${CI_PROJECT_ID}/pipelines?ref=$1&per_page=1")
        STATUS=$(echo "$PIPELINE_JSON" | jq -r '.[0].status // "unknown"')
        URL=$(echo "$PIPELINE_JSON" | jq -r '.[0].web_url // ""')
        echo "$STATUS|$URL"
      }
      DEV_INFO=$(get_pipeline_info "dev")
      STAGE_INFO=$(get_pipeline_info "stage")
      PROD_INFO=$(get_pipeline_info "prod")
    - |
      # Build Adaptive Card with dev/stage/prod status rows
      # Each row shows: environment name, status icon, links to site + pipeline

This runs on both success and failure (controlled by rules). The team gets a single card showing whether dev, stage, and prod are all green — useful after promotions cascade through the pipeline.

Pattern 11: Ephemeral Sandbox Environments

For complex features, I spin up a complete sandbox environment on MR pipelines. One click deploys infrastructure, backend, and frontend to an isolated environment:

deploy:sandbox:
  stage: deploy
  rules:
    - if: '$CI_PIPELINE_SOURCE == "merge_request_event" && $CI_MERGE_REQUEST_TARGET_BRANCH_NAME == "dev"'
      when: manual
  environment:
    name: sandbox
    url: https://sandbox.dev.example.com
    on_stop: destroy:sandbox

destroy:sandbox:
  stage: deploy
  when: manual
  environment:
    name: sandbox
    action: stop
  script:
    - terragrunt run --all destroy --non-interactive

when: manual is critical — you don't want every MR automatically provisioning cloud infrastructure. Engineers opt-in when they need it. The on_stop linkage ensures GitLab shows a "Stop" button to tear it down.

Pattern 12: Pipeline Debug Job

When a pipeline behaves unexpectedly — wrong environment, missing variables, rules not matching — you need visibility into what GitLab actually resolved at runtime. I keep a lightweight debug job in .pre that dumps the pipeline's state:

debug:
  extends: [.rules:all:mr:commit]
  stage: .pre
  variables:
    DEBUG_VARIABLES: >
      ENVIRONMENT
      AWS_ACCOUNT
      AWS_DEFAULT_REGION
      APP_DIR
      INFRA_DIR
      DRY_RUN
  script: |
    echo "=== Custom Variables ==="
    for var in ${DEBUG_VARIABLES}; do
      echo "$var = ${!var}"
    done
    echo "=== CI Variables ==="
    env | grep -E 'CI_' | sort
  cache: []
  interruptible: true

This runs on every pipeline. It costs under 5 seconds and uses no cache. When something goes wrong, the debug log is already there — no need to add a debug job after the fact and re-run.

The DEBUG_VARIABLES list is a curated set of the variables your pipeline actually uses. When you add a new variable to shared config, add it here too. The CI_* dump catches everything GitLab sets automatically — commit info, MR metadata, runner tags, feature flags — which is invaluable when rules aren't behaving as expected.

Putting It All Together

The complete pipeline runs about 20 jobs across 11 stages. On a dev MR where only frontend files changed, change detection skips backend, container, infrastructure, and most security jobs — the pipeline finishes in 3-4 minutes instead of 20.

Here's what the workflow looks like end to end:

Engineer opens MR targeting dev
Debug job dumps variable state in .pre
Security scans run in parallel: SAST, dependency audit, IaC scan
Tests run for changed domains only (change detection)
Build produces frontend bundle, Lambda zip, container tarball
Container image gets scanned via Trivy
Infrastructure plan shows what Terraform changes the code triggers
MR gets reviewed and merged
Commit pipeline runs everything, deploys to dev
.post stage auto-creates MR to stage
Stage MR pipeline validates, auto-merges on success
DAST scan runs against live stage API
Stage deploys, creates MR to prod
Prod MR is reviewed manually, merged, deployed
If anything fails, Teams gets notified with the exact failed job
Status dashboard shows all three environments at a glance

Key Takeaways

Modularize early. Split by domain (frontend, backend, infra, security) not by stage. Each domain owns its full lifecycle.

Anchor everything. If you're copying YAML between jobs, you're doing it wrong. Use anchors (&/*) within files and !reference across files.

Make MR pipelines fast, make commit pipelines thorough. Change detection on MRs, full runs on deployment branches.

Layer your security. Five scans across two stages: static analysis before deploy, dynamic analysis after. Advisory mode first, blocking after triage. Scan once on dev, promote with confidence.

Build, scan, then push. Kaniko's --no-push flag lets you scan container images before they touch a registry. The build-scan-push pipeline catches vulnerabilities before they're deployed.

Automate the boring stuff. Promotion MRs, failure notifications, reviewer assignment, environment dashboards — pipeline automation shouldn't stop at deploy.

Make pipelines debuggable. A 5-second debug job in .pre saves hours of troubleshooting when variables don't propagate or rules don't match.

Design for the person after you. Clear stage names, well-organized includes, and a README in .gitlab/ means the next engineer isn't reverse-engineering your YAML at 2am.