AWS Big Data Blog

Build data pipelines with dbt in Amazon Redshift using Amazon MWAA and Cosmos

Effective collaboration and scalability are essential for building efficient data pipelines. However, data modeling teams often face challenges with complex extract, transform, and load (ETL) tools, requiring programming expertise and a deep understanding of infrastructure. This complexity can lead to operational inefficiencies and challenges in maintaining data quality at scale.

dbt addresses these challenges by providing a simpler approach where data teams can build robust data models using SQL, a language they’re already familiar with. When integrated with modern development practices, dbt projects can use version control for collaboration, incorporate testing for data quality, and utilize reusable components through macros. dbt also automatically manages dependencies, making sure data transformations execute in the correct sequence.

In this post, we explore a streamlined, configuration-driven approach to orchestrate dbt Core jobs using Amazon Managed Workflows for Apache Airflow (Amazon MWAA) and Cosmos, an open source package. These jobs run transformations on Amazon Redshift, a fully managed data warehouse that enables fast, scalable analytics using standard SQL. With this setup, teams can collaborate effectively while maintaining data quality, operational efficiency, and observability. Key steps covered include:

  • Creating a sample dbt project
  • Enabling auditing within the dbt project to capture runtime metrics for each model
  • Creating a GitHub Actions workflow to automate deployments
  • Setting up Amazon Simple Notification Service (Amazon SNS) to proactively alert on failures

These enhancements enable model-level auditing, automated deployments, and real-time failure alerts. By the end of this post, you will have a practical and scalable framework for running dbt Core jobs with Cosmos on Amazon MWAA, so your team can ship reliable data workflows faster.

Solution overview

The following diagram illustrates the solution architecture.

The workflow contains the following steps:

  1. Analytics engineers manage their dbt project in their version control tool. In this post, we use GitHub as an example.
  2. We configure an Apache Airflow Directed Acyclic Graph (DAG) to use the Cosmos library to create an Airflow task group that contains all the dbt models as part of the dbt project.
  3. We use a GitHub Actions workflow to sync the dbt project files and the DAG to an Amazon Simple Storage Service (Amazon S3) bucket.
  4. During the DAG run, dbt converts the models, tests, and macros to Amazon Redshift SQL statements, which run directly on the Redshift cluster.
  5. If a task in the DAG fails, the DAG invokes an AWS Lambda function to send out a notification using Amazon SNS.

Prerequisites

You must have the following prerequisites:

Create a dbt project

A dbt project is structured to facilitate modular, scalable, and maintainable data transformations. The following code is a sample dbt project structure that this post will follow:

MY_SAMPLE_DBT_PROJECT
├── .github
│   └── workflows
│       └── publish_assets.yml
└── src
    ├── dags
    │   └── dbt_sample_dag.py
    └── my_sample_dbt_project
        ├── macros
        ├── models
        └── dbt_project.yml

dbt uses the following YAML files:

  • dbt_project.yml –  Serves as the main configuration for your project. Objects in this project will inherit settings defined here unless overridden at the model level. For example:
# Name your project! Project names should contain only lowercase characters
# and underscores. 
name: 'my_sample_dbt_project'
version: '1.0.0'

# These configurations specify where dbt should look for different types of files.
# The `model-paths` config, for example, states that models in this project can be
# found in the "models/" directory. 
model-paths: ["models"]
macro-paths: ["macros"]

# Configuring models
# Full documentation: https://docs.getdbt.com/docs/configuring-models
# In this example config, we tell dbt to build models in the example/
# directory as views. These settings can be overridden in the individual model
# files using the `{{ config(...) }}` macro.
models:
  my_sample_dbt_project:
    # Config indicated by + and applies to files under models/example/
    example:
      +materialized: view
      
on-run-end:
# add run results to audit table 
  - "{{ log_audit_table(results) }}" 
  • sources.yml – Defines the external data sources that your dbt models will reference. For example:
sources:
  - name: sample_source
    database: sample_database
    schema: sample_schema
    tables:
      - name: sample_table
  • schema.yml – Outlines the schema of your models and data quality tests. In the following example, we have defined two columns, full_name for the model model1 and sales_id for model2. We have declared them as the primary key and defined data quality tests to check if the two columns are unique and not null.
version: 2

models:
  - name: model1
    config: 
      contract: {enforced: true}

    columns:
      - name: full_name
        data_type: varchar(100)
        constraints:
          - type: primary_key
        tests:
          - unique
          - not_null

  - name: model2
    config: 
      contract: {enforced: true}

    columns:
      - name: sales_id
        data_type: varchar(100)
        constraints:
          - type: primary_key
        tests:
          - unique
          - not_null

Enable auditing within dbt project

Enabling auditing within your dbt project is crucial for facilitating transparency, traceability, and operational oversight across your data pipeline. You can capture run metrics at the model level for each execution in an audit table. By capturing detailed run metrics such as load identifier, runtime, and number of rows affected, teams can systematically monitor the health and performance of each load, quickly identify issues, and trace changes back to specific runs.

The audit table consists of the following attributes:

  • load_id – An identifier for each model run executed as part of the load
  • database_name – The name of the database within which data is being loaded
  • schema_name – The name of the schema within which data is being loaded
  • name – The name of the object within which data is being loaded
  • resource_type – The type of object to which data is being loaded
  • execution_time – The time duration taken for each dbt model to complete execution as part of each load
  • rows_affected – The number of rows affected in the dbt model as part of the load

Complete the following steps to enable auditing within your dbt project:

  1. Navigate to the models directory (src/my_sample_dbt_project/models) and create the audit_table.sql model file:
{%- set run_date = "CURRENT_DATE" -%}
{{
    config(
        materialized='incremental',
        incremental_strategy='append',
        tags=["audit"]
    )
}}

with empty_table as (
    select
        'test_load_id'::varchar(200) as load_id,
        'test_invocation_id'::varchar(200) as invocation_id,
        'test_database_name'::varchar(200) as database_name,
        'test_schema_name'::varchar(200) as schema_name,
        'test_model_name'::varchar(200) as name,
        'test_resource_type'::varchar(200) as resource_type,
        'test_status'::varchar(200) as status,
        cast('12122012' as float) as execution_time,
        cast('100' as int) as rows_affected,
        {{run_date}} as model_execution_date
)

select * from empty_table
-- This is a filter so we will never actually insert these values
where 1 = 0
  1. Navigate to the macros directory (src/my_sample_dbt_project/macros) and create the parse_dbt_results.sql macro file:
{% macro parse_dbt_results(results) %}
    -- Create a list of parsed results
    {%- set parsed_results = [] %}
    -- Flatten results and add to list
    {% for run_result in results %}
        -- Convert the run result object to a simple dictionary
        {% set run_result_dict = run_result.to_dict() %}
        -- Get the underlying dbt graph node that was executed
        {% set node = run_result_dict.get('node') %}
        {% set rows_affected = run_result_dict.get(
        'adapter_response', {}).get('rows_affected', 0) %}
        {%- if not rows_affected -%}
            {% set rows_affected = 0 %}
        {%- endif -%}
        {% set parsed_result_dict = {
                'load_id': invocation_id ~ '.' ~ node.get('unique_id'),
                'invocation_id': invocation_id,
                'database_name': node.get('database'),
                'schema_name': node.get('schema'),
                'name': node.get('name'),
                'resource_type': node.get('resource_type'),
                'status': run_result_dict.get('status'),
                'execution_time': run_result_dict.get('execution_time'),
                'rows_affected': rows_affected
                }%}
        {% do parsed_results.append(parsed_result_dict) %}
    {% endfor %}
    {{ return(parsed_results) }}
{% endmacro %}
  1. Navigate to the macros directory (src/my_sample_dbt_project/macros) and create the log_audit_table.sql macro file:
{% macro log_audit_table(results) %}
    -- depends_on: {{ ref('audit_table') }}
    {%- if execute -%}
        {{ print("Running log_audit_table Macro") }}
        {%- set run_date = "CURRENT_DATE" -%}
        {%- set parsed_results = parse_dbt_results(results) -%}
        {%- if parsed_results | length  > 0 -%}
            {% set allowed_columns = ['load_id', 'invocation_id', 'database_name', 
            'schema_name', 'name', 'resource_type', 'status', 'execution_time', 
            'rows_affected', 'model_execution_date'] -%}
            {% set insert_dbt_results_query -%}
                insert into {{ ref('audit_table') }}
                    (
                        load_id,
                        invocation_id,
                        database_name,
                        schema_name,
                        name,
                        resource_type,
                        status,
                        execution_time,
                        rows_affected,
                        model_execution_date
                ) values
                    {%- for parsed_result_dict in parsed_results -%}
                        (
                            {%- for column, value in parsed_result_dict.items() %}
                                {% if column not in allowed_columns %}
                                    {{ exceptions.raise_compiler_error("Invalid
                                     column") }}
                                {% endif %}
                                {% set sanitized_value = value | replace("'", "''") %}
                                '{{ sanitized_value }}'
                                {%- if not loop.last %}, {% endif %}
                            {%- endfor -%}
                        )
                        {%- if not loop.last %}, {% endif %}
                    {%- endfor -%}
            {%- endset -%}
            {%- do run_query(insert_dbt_results_query) -%}
        {%- endif -%}
    {%- endif -%}
    {{ return ('') }}
{% endmacro %}
  1. Append the following lines to the dbt_project.yml file:
on-run-end:
  - "{{ log_audit_table(results) }}" 

Create a GitHub Actions workflow

This step is optional. If you prefer, you can skip it and instead upload your files directly to your S3 bucket.

The following GitHub Actions workflow automates the deployment of dbt project files and DAG file to Amazon S3. Replace the placeholders {s3_bucket_name}, {account_id}, {role_name}, and {region} with your S3 bucket name, account ID, IAM role name, and AWS Region in the workflow file.

To enhance security, it’s recommended to use OpenID Connect (OIDC) for authentication with IAM roles in GitHub Actions instead of relying on long-lived access keys.

name: Sync dbt Project with S3

on:
  workflow_dispatch:
  push:
    branches: [ main ]
    paths:
      - "src/**"

permissions:
  id-token: write   # This is required for requesting the JWT
  contents: read    # This is required for actions/checkout
  pull-requests: write

jobs:
  sync-dev:
    runs-on: ubuntu-latest
    environment: dev
    defaults:
      run:
        shell: bash
    steps:
      - uses: actions/checkout@v4
      - name: Assume AWS IAM Role
        uses: aws-actions/configure-aws-credentials@v4.0.2
        with:
          aws-region: {region}
          role-to-assume: arn:aws:iam::{account_id}:role/{role_name}
          role-session-name: my_sample_dbt_project_${{ github.run_id }}
          role-duration-seconds: 3600 # 1 hour

      - run: aws sts get-caller-identity

      - name: Sync dbt Model files
        id: dbt_project_files
        working-directory: src/my_sample_dbt_project
        run: aws s3 sync . s3://{s3_bucket_name}/dags/dbt/my_sample_dbt_project 
        --delete
        continue-on-error: false

      - name: Sync DAG files
        id: dag_file
        working-directory: src/dags
        run: aws s3 sync . s3://{s3_bucket_name}/dags

GitHub has the following security requirements:

  • Branch protection rules – Before proceeding with the GitHub Actions workflow, make sure branch protection rules are in place. These rules enforce required status checks before merging code into protected branches (such as main).
  • Code review guidelines – Implement code review processes to make sure changes undergo review. This can include requiring at least one approving review before code is merged into the protected branch.
  • Incorporate security scanning tools – This can help detect vulnerabilities in your repository.

Make sure you are also adhering to dbt-specific security best practices:

  • Pay attention to dbt macros with variables and validate their inputs.
  • When adding new packages to your dbt project, evaluate their security, compatibility, and maintenance status to make sure they don’t introduce vulnerabilities or conflicts into your project.
  • Review dynamically generated SQL to safeguard against issues like SQL injection.

Update the Amazon MWAA instance

Complete the following steps to update the Amazon MWAA instance:

  1. Install the Cosmos library on Amazon MWAA by adding astronomer-cosmos in the requirements.txt file. Make sure to check for version compatibility for Amazon MWAA and the Cosmos library.
  2. Add the following entries in your startup.sh script:
    1. In the following code, DBT_VENV_PATH specifies the location where the Python virtual environment for dbt will be created. DBT_PROJECT_PATH points to the location of your dbt project inside Amazon MWAA.
      #!/bin/sh
      export DBT_VENV_PATH="${AIRFLOW_HOME}/dbt_venv"
      export DBT_PROJECT_PATH="${AIRFLOW_HOME}/dags/dbt"
    2. The following code creates a Python virtual environment at the path ${DBT_VENV_PATH} and installs the dbt-redshift adapter to run dbt transformations on Amazon Redshift:
      python3 -m venv "${DBT_VENV_PATH}"
      ${DBT_VENV_PATH}/bin/pip install dbt-redshift

Create a dbt user in Amazon Redshift and store credentials

To create dbt models in Amazon Redshift, you must set up a native Redshift user with the necessary permissions to access source tables and create new tables. It is essential to create separate database users with minimal permissions to follow the principle of least privilege. The dbt user should not be granted admin privileges, instead, it should only have access to the specific schemas required for its tasks.

Complete the following steps:

  1. Open the Amazon Redshift console and connect as an admin (for more details, refer to Connecting to an Amazon Redshift database).
  2. Run the following command in the query editor v2 to create a native user, and note down the values for dbt_user_name and password_value:
    create user {dbt_user_name} password 'sha256|{password_value}';
  3. Run the following commands in the query editor v2 to grant permissions to the native user:
    1. Connect to the database where you want to source tables from and run the following commands:
      grant usage on schema {schema_name} to {dbt_user_name};
      grant select on all tables in schema {schema_name} to {dbt_user_name};
    2. To allow the user to create tables within a schema, run the following command:
      grant create on schema {schema_name} to {dbt_user_name};
  4. Optionally, create a secret in AWS Secrets Manager and store the values for dbt_user_name and password_value from the previous step as plaintext:
{
    "username":"dbt_user_name",
    "password":"password_value"
}

Creating a Secrets Manager entry is optional, but recommended for securely storing your credentials instead of hardcoding them. To learn more, refer to AWS Secrets Manager best practices.

Create a Redshift connection in Amazon MWAA

We create one Redshift connection in Amazon MWAA for each Redshift database, making sure that each data pipeline (DAG) can only access one database. This approach provides distinct access controls for each pipeline, helping prevent unauthorized access to data. Complete the following steps:

  1. Log in to the Amazon MWAA UI.
  2. On the Admin menu, choose Connections.
  3. Choose Add a new record.
  4. For Connection Id, enter a name for this connection.
  5. For Connection Type, choose Amazon Redshift.
  6. For Host, enter the endpoint of the Redshift cluster without the port and database name (for example, redshift-cluster-1.xxxxxx.us-east-1.redshift.amazonaws.com).
  7. For Database, enter the database of the Redshift cluster.
  8. For Port, enter the port of the Redshift cluster.

Set up an SNS notification

Setting up SNS notifications is optional, but they can be a useful enhancement to receive alerts on failures. Complete the following steps:

  1. Create an SNS topic.
  2. Create a subscription to the SNS topic.
  3. Create a Lambda function with the Python runtime.
  4. Modify the function code in your Lambda function, and replace {topic_arn} with your SNS topic Amazon Resource Name (ARN):
import json

sns_client = boto3.client('sns')

def lambda_handler(event, context):
     try:
        # Extract DAG name from event
        failed_dag = event['dag_name']
        
        # Send notification 
        sns_client.publish(
            TopicArn={topic_arn}, 
            Subject="Data modelling dags - WARNING", 
            Message=json.dumps({'default': json.dumps(f"Data modelling DAG - 
            {failed_dag} has failed, please inform the data modelling team")}),
            MessageStructure='json'
        )
        
    except KeyError as e:
        # Handle missing 'dag_name' in the event
        logger.error(f"KeyError: invalid payload - dag_name not present")

Configure a DAG

The following sample DAG orchestrates a dbt workflow for processing and auditing data models in Amazon Redshift. It retrieves credentials from Secrets Manager, runs dbt tasks in a virtual environment, and sends an SNS notification if a failure occurs. The workflow consists of the following steps:

  1. It starts with the audit_dbt_task task group, which creates the audit model.
  2. The transform_data task group executes the other dbt models, excluding the audit-tagged one. Inside the transform_data group, there are two dbt models, model1 and model2, and each is followed by a corresponding test task that runs data quality tests defined in the schema.yml file.
  3. To properly detect and handle failures, the DAG includes a dbt_check Python task that runs a custom function, check_dbt_failures. This is important because when using DbtTaskGroup, individual model-level failures inside the group don’t automatically propagate to the task group level. As a result, downstream tasks (such as the Lambda operator sns_notification_for_failure) configured with trigger_rule='one_failed' will not be triggered unless a failure is explicitly raised.

The check_dbt_failures function addresses this by inspecting the results of each dbt model and test, and raising an AirflowException if a failure is found. When an AirflowException is raised, the sns_notification_for_failure task is triggered.

  1. If a failure occurs, the sns_notification_for_failure task invokes a Lambda function to send an SNS notification. If no failures are detected, this task is skipped.

The following diagram illustrates this workflow.

Configure DAG variables

To customize this DAG for your environment, configure the following variables:

  • project_name – Make sure the project_name matches the S3 prefix of your dbt project
  • secret_name – Provide the name of the secret that stores dbt user credentials
  • target_database and target_schema – Update these variables to reflect where you want to land your dbt models in Amazon Redshift
  • redshift_connection_id – Set this to match the connection configured in Amazon MWAA for this Redshift database
  • sns_lambda_function_name – Provide the Lambda function name to send SNS notifications
  • dag_name – Provide the DAG name that will be passed to the SNS notification Lambda function
import os
import json
import boto3
from airflow import DAG
from cosmos import (
    DbtTaskGroup, ProfileConfig, ProjectConfig,
    ExecutionConfig, RenderConfig
)
from cosmos.constants import ExecutionMode, LoadMode
from cosmos.profiles import RedshiftUserPasswordProfileMapping
from pendulum import datetime
from airflow.operators.python_operator import PythonOperator
from airflow.providers.amazon.aws.operators.lambda_function import (
    LambdaInvokeFunctionOperator
)
from airflow.exceptions import AirflowException

# project name - should match the s3 prefix of your dbt project
project_name = "my_sample_dbt_project"
# name of the secret that stores dbt user credentials 
secret_name = "dbt_user_credentials_secret"
# target database to land dbt models
target_database = "sample_database"
# target schema to land dbt models
target_schema = "sample_schema"
# Redshift connection name from MWAA
redshift_connection_id = "my_sample_dbt_project_connection"
# sns lambda function name
sns_lambda_function_name = "sns_notification"
# dag name - this will be passed to SNS for notification
payload = json.dumps({
            "dag_name": "my_sample_dbt_project_dag"
        })

Incorporate DAG components

After setting the variables, you can now incorporate the following components to complete the DAG.

Secrets Manager

The DAG retrieves dbt user credentials from Secrets Manager:

sm_client = boto3.client('secretsmanager')

def get_secret(secret_name):
    try:
        get_secret_value_response = sm_client.get_secret_value(SecretId=secret_name)
        return json.loads(get_secret_value_response["SecretString"])
    except Exception as e:
        raise

secret_value = get_secret(secret_name)
username = secret_value["username"]
password = secret_value["password"]

Redshift connection configuration

It uses RedshiftUserPasswordProfileMapping to authenticate:

profile_config = ProfileConfig(
    profile_name="redshift",
    target_name=target_database,
    profile_mapping=RedshiftUserPasswordProfileMapping(
        conn_id=redshift_connection_id,
        profile_args={"schema": target_schema,
                      "user": username, "password": password}
    ),
)

dbt execution setup

This code contains the following variables:

  • dbt executable path – Uses a virtual environment
  • dbt project path – Is located in the environment variable DBT_PROJECT_PATH under your project
execution_config = ExecutionConfig(
    dbt_executable_path=f"{os.environ['DBT_VENV_PATH']}/bin/dbt",
    execution_mode=ExecutionMode.VIRTUALENV,
)

project_config = ProjectConfig(
    dbt_project_path=f"{os.environ['DBT_PROJECT_PATH']}/{project_name}",
)

Tasks and execution flow

This step includes the following components:

  • Audit dbt task group (audit_dbt_task) – Runs the dbt model tagged with audit
  • dbt task group (transform_data) – Runs the dbt models tagged with operations, excluding the audit model

In dbt, tags are labels that you can assign to models, tests, seeds, and other dbt resources to organize and selectively run subsets of your dbt project. In your render_config, you have exclude=["tag:audit"]. This means dbt will exclude models that have the tag audit, because the audit model runs separately.

  • Failure check (dbt_check) – Checks for dbt model failures, raises an AirflowException if upstream dbt tasks fail
  • SNS notification on failure (sns_notification_for_failure) – Invokes a Lambda function to send an SNS notification upon a dbt task failure (for example, a dbt model in the task group)
def check_dbt_failures(**kwargs):
    if kwargs['ti'].state == 'failed':
        raise AirflowException('Failure in dbt task group')

with DAG(
    dag_id="my_sample_dbt_project_dag",
    start_date=datetime(2025, 4, 2),
    schedule_interval="@daily",
    catchup=False,
    tags=["dbt"]
):

    audit_dbt_task = DbtTaskGroup(
        group_id="audit_dbt_task",
        execution_config=execution_config,
        profile_config=profile_config,
        project_config=project_config,
        operator_args={
            "install_deps": True,
        },
        render_config= RenderConfig(
            select=["tag:audit"],
            load_method=LoadMode.DBT_LS
        )
    )

    transform_data = DbtTaskGroup(
        group_id="transform_data",
        execution_config=execution_config,
        profile_config=profile_config,
        project_config=project_config,
        operator_args={
            "install_deps": True,
            # install necessary dependencies before running dbt command
        },
        render_config= RenderConfig(
            exclude=["tag:audit"],
            load_method=LoadMode.DBT_LS
        )
    )

    dbt_check = PythonOperator(
        task_id='dbt_check', 
        python_callable=check_dbt_failures,
        provide_context=True,
    )

    sns_notification_for_failure = LambdaInvokeFunctionOperator(
        task_id="sns_notification_for_failure",
        function_name=sns_lambda_function_name,
        payload=payload,
        trigger_rule='one_failed'
    )

    audit_dbt_task >> transform_data >> dbt_check >> sns_notification_for_failure

The sample dbt orchestrates a dbt workflow in Amazon Redshift, starting with an audit task and followed by a task group that processes data models. It includes a failure handling mechanism that checks for failures and raises an exception to trigger an SNS notification using Lambda if a failure occurs. If no failures are detected, the SNS notification task is skipped.

Clean up

If you no longer need the resources you created, delete them to avoid additional charges. This includes the following:

  • Amazon MWAA environment
  • S3 bucket
  • IAM role
  • Redshift cluster or serverless workgroup
  • Secrets Manager secret
  • SNS topic
  • Lambda function

Conclusion

By integrating dbt with Amazon Redshift and orchestrating workflows using Amazon MWAA and the Cosmos library, you can simplify data transformation workflows while maintaining robust engineering practices. The sample dbt project structure, combined with automated deployments through GitHub Actions and proactive monitoring using Amazon SNS, provides a foundation for building reliable data pipelines. The addition of audit logging facilitates transparency across your transformations, so teams can maintain high data quality standards.

You can use this solution as a starting point for your own dbt implementation on Amazon MWAA. The approach we outlined emphasizes SQL-based transformations while incorporating essential operational capabilities like deployment automation and failure alerting. Get started by adapting the configuration to your environment, and build upon these practices as your data needs evolve.

For more resources, refer to Manage data transformations with dbt in Amazon Redshift and Redshift setup.


About the authors

Cindy Li is an Associate Cloud Architect at AWS Professional Services, specialising in Data Analytics. Cindy works with customers to design and implement scalable data analytics solutions on AWS. When Cindy is not diving into tech, you can find her out on walks with her playful toy poodle Mocha.

Akhil B is a Data Analytics Consultant at AWS Professional Services, specializing in cloud-based data solutions. He partners with customers to design and implement scalable data analytics platforms, helping organizations transform their traditional data infrastructure into modern, cloud-based solutions on AWS. His expertise helps organizations optimize their data ecosystems and maximize business value through modern analytics capabilities.

Joao Palma is a Senior Data Architect at Amazon Web Services, where he partners with enterprise customers to design and implement comprehensive data platform solutions. He specializes in helping organizations transform their data into strategic business assets and enabling data-driven decision making.

Harshana Nanayakkara is a Delivery Consultant at AWS Professional Services, where he helps customers tackle complex business challenges using AWS Cloud technology. He specializes in data and analytics, data governance, and AI/ML implementations.