AWS Big Data Blog

Capture data lineage from dbt, Apache Airflow, and Apache Spark with Amazon SageMaker

The next generation of Amazon SageMaker is the center for your data, analytics, and AI. SageMaker brings together AWS artificial intelligence and machine learning (AI/ML) and analytics capabilities and delivers an integrated experience for analytics and AI with unified access to data. From Amazon SageMaker Unified Studio, a single data and AI development environment, you can access your data and use a suite of powerful tools for data processing, SQL analytics, model development, training and inference, as well as generative AI development. This unified experience is assisted by Amazon Q and Amazon SageMaker Catalog (powered by Amazon DataZone), which delivers an embedded generative AI and governance experience at every step.

With data lineage, now part of SageMaker Catalog, domain administrators and data producers can centralize lineage metadata of their data assets in a single place. You can track the flow of data over time, giving you a clear understanding of where it originated, how it has changed, and its ultimate use across the business. By providing this level of transparency around the origin of data, data lineage helps data consumers gain trust that the data is correct for their use case. Because data lineage is captured at the table, column, and job level, data producers can also conduct impact analysis and respond to data issues when needed.

Capture of data lineage in SageMaker starts after connections and data sources are configured and lineage events are generated when data is transformed in AWS Glue or Amazon Redshift. This capability is also fully compatible with OpenLineage, so you can further augment data lineage capture to other data processing tools. This post walks you through how to use the OpenLineage-compatible API of SageMaker or Amazon DataZone to push data lineage events programmatically from tools supporting the OpenLineage standard like dbt, Apache Airflow, and Apache Spark.

Solution overview

Many third-party and open source tools that are used today to orchestrate and run data pipelines, like dbt, Airflow, and Spark, have active support of the OpenLineage standard to provide interoperability across environments. With this capability, you only need to include and configure the right library to your environment, to be able to stream lineage events from jobs running on this tool directly to their corresponding output logs or to a target HTTP endpoint that you specify.

With the target HTTP endpoint option, you can introduce a pattern to post lineage events from these tools into SageMaker or Amazon DataZone to further help you centralize governance of your data assets and processes in a single place. This pattern takes the form of a proxy, and its simplified architecture is illustrated in the following figure.

The way that the proxy for OpenLineage works is simple:

  • Amazon API Gateway exposes an HTTP endpoint and path. Jobs running with the OpenLineage package on top of the supported data processing tools can be set up with the HTTP transport option pointing to this endpoint and path. If connectivity allows, lineage events will be streamed into this endpoint as the job runs.
  • An Amazon Simple Queue Service (Amazon SQS) queue buffers the events as they arrive. By storing them in a queue, you have the option to implement strategies for retries and errors when needed. For cases where event order is required, we recommend the use of first-in-first-out (FIFO) queues; however, SageMaker and Amazon DataZone are able to map incoming OpenLineage events, even if they are out of order.
  • An AWS Lambda function retrieves events from the queue in batches. For every event in a batch, the function can perform transformations when needed and post the resulting event to the target SageMaker or Amazon DataZone domain.
  • Even though it’s not shown in the architecture, AWS Identity and Access Management (IAM) and Amazon CloudWatch are key capabilities that allow secure interaction between resources with minimum permissions and logging for troubleshooting and observability.

The AWS sample OpenLineage HTTP Proxy for Amazon SageMaker Governance and Amazon DataZone provides a working implementation of this simplified architecture that you can test and customize as needed. To deploy in a test environment, follow the steps as described in the repository. We use an AWS CloudFormation template to deploy solution resources.

After you have deployed the OpenLineage HTTP Proxy solution, you can use it to post lineage events from data processing tools like dbt, Airflow, and Spark into a SageMaker or Amazon DataZone domain, as shown in the following examples.

Set up the OpenLineage package for Spark in AWS Glue 4.0

AWS Glue added built-in support for OpenLineage with AWS Glue 5.0 (to learn more, see Introducing AWS Glue 5.0 for Apache Spark). For jobs that are still running on AWS Glue 4.0, you still can stream OpenLineage events into SageMaker or Amazon DataZone by using the OpenLineage HTTP Proxy solution. This serves as an example that can be applied to other platforms running Spark like Amazon EMR, third-party solutions, or self-managed clusters.

To properly add OpenLineage capabilities to an AWS Glue 4.0 job and configure it to stream lineage events into the OpenLineage HTTP Proxy solution, complete the following steps:

  1. Download the official OpenLineage package for Spark. For our example, we used the JAR package version 2.12 release 1.9.1.
  2. Store the JAR file in an Amazon Simple Storage Service (Amazon S3) bucket that can be accessed by your AWS Glue job.
  3. On the AWS Glue console, open your job.
  4. Under Libraries, for Dependent JARs path, enter the path of the JAR package stored in your S3 bucket.

  1. In the Job parameters section, add the following parameters:
    1. Enable the OpenLineage package:
      1. Key: --user-jars-first
      2. Value: true
    2. Configure how the OpenLineage package will be used to stream lineage events. Replace <OPENLIMEAGE_PROXY_ENDPOINT_URL> and <OPENLIMEAGE_PROXY_ENDPOINT_PATH> with the corresponding values of the OpenLineage HTTP Proxy solution. These values can be found as outputs of the deployed CloudFormation stack. Replace <ACCOUNT_ID> with your AWS account ID.
      1. Key: --conf
      2. Value:
        spark.extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener 
        --conf spark.openlineage.transport.type=http 
        --conf spark.openlineage.transport.url=<OPENLIMEAGE_PROXY_ENDPOINT_URL>
        --conf spark.openlineage.transport.endpoint=/<OPENLIMEAGE_PROXY_ENDPOINT_PATH>
        --conf spark.openlineage.facets.custom_environment_variables=[AWS_DEFAULT_REGION;GLUE_VERSION;GLUE_COMMAND_CRITERIA;GLUE_PYTHON_VERSION;] 
        --conf spark.glue.accountId=<ACCOUNT_ID>

With this setup, the AWS Glue 4.0 job will use the HTTP transport option of the OpenLineage package to stream lineage events into the OpenLineage proxy, which will post events to the SageMaker or Amazon DataZone domain.

  1. Run the AWS Glue 4.0 job.

The job’s resulting datasets should be sourced into SageMaker or Amazon DataZone so that OpenLineage events are mapped to them. As you explore the sourced dataset in SageMaker Unified Studio, you can observe its origin path as described by the OpenLineage events streamed through the OpenLineage proxy.

When working with Amazon DataZone, you will get the same result.

The origin path in this example is extensive and maps the resulting dataset down to its origin, in this case, a couple of tables hosted in a relational database and transformed through a data pipeline with two AWS Glue 4.0 (Spark) jobs.

Set up the OpenLineage package for dbt

dbt has rapidly become a popular framework to build data pipelines on top of data processing and data warehouse tools like Amazon Redshift, Amazon EMR, and AWS Glue, as well as other traditional and third-party solutions. This framework supports OpenLineage as a way to standardize generation of lineage events and integrate with the growing data governance ecosystem.dbt deployments might vary per environment, which is why we don’t dive into the specifics in this post. However, to simply configure your dbt project to leverage the OpenLineage HTTP Proxy solution, complete the following steps:

  1. Install the OpenLineage package for dbt. You can learn more in the OpenLineage documentation.
  2. In the root folder of your dbt project, create an openlineage.yml file where you can specify the transport configuration. Replace <OPENLIMEAGE_PROXY_ENDPOINT_URL> and <OPENLIMEAGE_PROXY_ENDPOINT_PATH> with the values of the OpenLineage HTTP Proxy solution. These values can be found as outputs of the deployed CloudFormation stack.
transport:
  type: http
  url: <OPENLIMEAGE_PROXY_ENDPOINT_URL>
  endpoint: <OPENLIMEAGE_PROXY_ENDPOINT_PATH>
  timeout: 5

  1. Run your dbt pipeline. As explained in the OpenLineage documentation, instead of running the standard dbt run command, you run the dbt-ol run command. The latter command is just a wrapper on top of the standard dbt run command so that lineage events are captured and streamed as configured.

The job’s resulting datasets should be sourced into SageMaker or Amazon DataZone so that OpenLineage events are mapped to them. As you explore the sourced dataset in SageMaker Unified Studio, you can observe its lineage path as described by the OpenLineage events streamed through the OpenLineage proxy.

When working with Amazon DataZone, you will get the same result.

In this example, the dbt project is running on top of Amazon Redshift, which is a common use case among customers. Amazon Redshift is integrated for automatic lineage capture with SageMaker and Amazon DataZone, but such capabilities weren’t used as part of this example to illustrate how you can still integrate OpenLineage events from dbt using the pattern implemented in the OpenLineage HTTP Proxy solution.The dbt pipeline is made by two stages running sequentially, which are illustrated in the origin path as the nodes with the dbt type.

Set up the OpenLineage package for Airflow

Airflow is a well-positioned tool to orchestrate data pipelines at any scale. AWS provides Amazon Managed Workflows for Apache Airflow (Amazon MWAA) as a managed alternative for customers that want to reduce management and accelerate the development of their data strategy with Airflow in a cost-effective way. Airflow also supports OpenLineage, so you can centralize lineage with tools like SageMaker and Amazon DataZone.

The following steps are specific for Amazon MWAA, but they can be extrapolated to other forms of deployment of Airflow:

  1. Install the OpenLineage package for Airflow. You can learn more in the OpenLineage documentation. For versions 2.7 and later, it’s recommended to use the native Airflow OpenLineage package (apache-airflow-providers-openlineage), which is the case for this example.
  2. To install the package, add it to the requirements.txt file that you are storing in Amazon S3 and that you are pointing to when provisioning your Amazon MWAA environment. To learn more, refer to Managing Python dependencies in requirements.txt.
  3. As you install the OpenLineage package or afterwards, you can configure it to send lineage events to the OpenLineage proxy:
    1. When filling the form to create a new Amazon MWAA environment or edit an existing one, in the Airflow configuration options section, add the following. Replace <OPENLIMEAGE_PROXY_ENDPOINT_URL> and <OPENLIMEAGE_PROXY_ENDPOINT_PATH> with the values of the OpenLineage HTTP Proxy solution. These values can be found as outputs of the deployed CloudFormation stack:
      1. Configuration option: openlineage.transport
      2. Custom value: {"type": "http", "url": "<OPENLIMEAGE_PROXY_ENDPOINT_URL>", "endpoint": "<OPENLIMEAGE_PROXY_ENDPOINT_PATH>"}

  1. Run your pipeline.

The Airflow tasks will automatically use the transport configuration to stream lineage events into the OpenLineage proxy as they run. The task’s resulting datasets should be sourced into SageMaker or Amazon DataZone so that OpenLineage events are mapped to them.As you explore the sourced dataset in SageMaker Unified Studio, you can observe its origin path as described by the OpenLineage events streamed through the OpenLineage proxy.

When working with Amazon DataZone, you will get the same result.

In this example, the Amazon MWAA Directed Acyclic Graph (DAG) is operating on top of Amazon Redshift, similar to the dbt example before. However, it’s still not using the native integration for automatic data capture between Amazon Redshift and SageMaker or Amazon DataZone. This way, we can illustrate how you can still integrate OpenLineage events from Airflow using the pattern implemented in the OpenLineage HTTP Proxy solution.The Airflow DAG is made by a single task that outputs the resulting dataset by using datasets that were created as part of the dbt pipeline in the previous example. This is illustrated in the origin path, where it includes nodes with the dbt type and a node with AIRFLOW type. With this final example, note how SageMaker and Amazon DataZone map all datasets and jobs to reflect the reality of your data pipelines.

Additional considerations when implementing the OpenLineage proxy pattern

The OpenLineage proxy pattern implemented in the sample OpenLineage HTTP Proxy solution and presented in this post has shown to be a practical approach to integrate a growing set of data processing tools into a centralized data governance strategy on top of SageMaker. We encourage you to dive into it and use it in your test environments to learn how it can be best used for your specific setup.If interested in taking this pattern to production, we suggest you first review it thoroughly and customize it to your particular needs. The following are some items worth reviewing as you evaluate this pattern implementation:

  • The solution used in the examples of this post uses a public API endpoint with no authentication or authorization mechanism. For a production workload, we recommend limiting access to the endpoint to a minimum so only authorized resources are able to stream messages into it. To learn more, refer to Control and manage access to HTTP APIs in API Gateway.
  • The logic implemented in the Lambda function is intended to be customized depending on your use case. You might need to implement transformation logic, depending on how OpenLineage events are created by the tool you are using. As a reference, for the case of the Amazon MWAA example of this post, some minor transformations were required on the name and namespace fields of the inputs and outputs elements of the event for full compatibility with the format expected for Amazon Redshift datasets as described in the dataset naming conventions of OpenLineage. You might also need to change how the function logs execution details or include retry/error logic and more.
  • The SQS queue used in the OpenLineage HTTP Proxy solution is standard, which implies that events aren’t delivered in order. If this is a requirement, you could use FIFO queues instead.

For cases where you want to post OpenLineage events directly into SageMaker or Amazon DataZone, without using the proxy pattern explained in this post, a custom transport is now available as an extension of the OpenLineage project version 1.33.0. Leverage this feature in cases where you don’t need additional controls on your OpenLineage event stream, for example, if you don’t need any custom transformation logic.

Summary

In this post, we showed how to use the OpenLineage-compatible APIs of SageMaker to capture data lineage from any tool supporting this standard, by following an architectural pattern introduced as the OpenLineage proxy. We presented some examples of how you can set up tools like dbt, Airflow, and Spark to stream lineage events to the OpenLineage proxy, which subsequently posts them to a SageMaker or Amazon DataZone domain. Finally, we introduced a working implementation of this pattern that you can test and discussed some considerations when implementing this same pattern to production.

The SageMaker compatibility with OpenLineage can help simplify governance of your data assets and increase trust in your data. This capability is one of the many features that are now available to build a comprehensive governance strategy powered by data lineage, data quality, business metadata, data discovery, access automation, and more. By bundling data governance capabilities with the growing set of tools available for data and AI development, you can derive value from your data faster and get closer to consolidating a data-driven culture. Try out this solution and get started with SageMaker to join the growing set of customers that are modernizing their data platform.


About the authors

Jose Romero is a Senior Solutions Architect for Startups at AWS, based in Austin, Texas. He is passionate about helping customers architect modern platforms at scale for data, AI, and ML. As a former senior architect in AWS Professional Services, he enjoys building and sharing solutions for common complex problems so that customers can accelerate their cloud journey and adopt best practices. Connect with him on LinkedIn.

Priya Tiruthani is a Senior Technical Product Manager with Amazon SageMaker Catalog (Amazon DataZone) at AWS. She focuses on building products and their capabilities in data analytics and governance. She is passionate about building innovative products to address and simplify customers’ challenges in their end-to-end data journey. Outside of work, she enjoys being outdoors to hike and capture nature’s beauty. Connect with her on LinkedIn.