AWS Big Data Blog
Orchestrate data processing jobs, querybooks, and notebooks using visual workflow experience in Amazon SageMaker
Automation of data processing and data integration tasks and queries is essential for data engineers and analysts to maintain up-to-date data pipelines and reports. Amazon SageMaker Unified Studio is a single data and AI development environment where you can find and access the data in your organization and act on it using the ideal tools for your use case. SageMaker Unified Studio offers multiple ways to integrate with data through the Visual ETL, Query Editor, and JupyterLab builders. SageMaker is natively integrated with Apache Airflow and Amazon Managed Workflows for Apache Airflow (Amazon MWAA), and is used to automate the workflow orchestration for jobs, querybooks, and notebooks with a Python-based DAG definition.
Today, we are excited to launch a new visual workflows builder in SageMaker Unified Studio. With the new visual workflow experience, you don’t need to code the Python DAGs manually. Instead, you can visually define the orchestration workflow in SageMaker Unified Studio, and the visual definition is automatically converted to a Python DAG definition that is supported in Airflow. This post demonstrates the new visual workflow experience in SageMaker Unified Studio.
Example use case
In this post, a fictional ecommerce company sells many different products, like books, toys, and jewelry. Customers can leave reviews and star ratings for each product so other customers can make informed decisions about what they should buy. We use a sample synthetic review dataset for demonstration purposes, which includes different products and customer reviews.In this example, we demonstrate the new visual workflow experience with a data processing job, SQL querybook, and notebook. We also identify the top 10 customers who have contributed the most helpful votes per category.The following diagram illustrates the solution architecture.
In the following sections, we show how to configure a series of components using data processing jobs, querybooks, and notebooks with SageMaker Unified Studio visual workflows. You can use sample data to extract information from the specific category, update partition metadata, and display query results in the notebook using Python code.
Prerequisites
To get started, you must have the following prerequisites:
- An AWS account
- A SageMaker Unified Studio domain. To use the sample data provided in this blog post, your domain should be in
us-east-1
region. - A SageMaker Unified Studio project with the Data analytics and AI-ML model development project profile
- A workflow environment
Create a data processing job
The first step is to create a data processing job to run visual transformations to identify top contributing customers per category. Complete the following steps to create a data processing job:
- On the top menu, under Build, choose Visual ETL flow.
- Choose the plus sign, and under Data sources, choose Amazon S3.
- Choose the Amazon S3 source node and enter the following values:
- S3 URI:
s3://aws-bigdata-blog/generated_synthetic_reviews/data/
- Format: Parquet
- S3 URI:
- Choose Update node.
- Choose the plus sign, and under Transform, choose Filter.
- Choose the Filter node and enter the following values:
- Filter Type: Global AND
- Key:
product_category
- Operation:
==
- Value:
Books
- Choose Update node.
- Choose the plus sign, and under Data targets, choose Amazon S3.
- Choose the S3 node and enter the following values:
- S3 URI: Use the Amazon S3 location from the project overview page and add the suffix
/data/books_synthetic_reviews/
(for example,/dzd_al0ii4pi2sqv68/awi0lzjswu0yhc/dev/data/books_synthetic_reviews/
) - Format: Parquet
- Compression: Snappy
- Partition keys:
marketplace
- Mode: Overwrite
- Update Catalog: True
- Database: Choose your database
- Table:
books_synthetic_review
- Include header: False
- S3 URI: Use the Amazon S3 location from the project overview page and add the suffix
- Choose Update node.
At this point, you should have an end-to-end visual flow. Now you can publish it.
- Choose Save to project to save the draft flow.
- Change Job name to
filter-books-synthetic-review
, then choose Update.
The data processing job has been successfully created.
Create a querybook
Complete the following steps to create a querybook to run a SQL query against the source table to recognize partitions:
- Choose the plus sign next to the querybook tab to open new querybook.
- Enter the following query and choose Save to project. The query
MSCK REPAIR TABLE
is prepared for recognizing partitions in the table. We don’t run this querybook yet because the querybook is designed to be triggered by a workflow.
MSCK REPAIR TABLE `books_synthetic_review`;
- For Querybook title, enter
QueryBook-synthetic-review-<timestamp>
, then choose Save changes.
The querybook to recognize new partitions has been successfully created.
Create a notebook
Next, we create notebook to generate output and visualize the results. Complete following steps:
- On the top menu, under Build, choose JupyterLab.
- Choose File, New, and Notebook to create a new notebook.
- Enter the following code snippets into notebook cells and save them (provide your AWS account ID, AWS Region, and S3 bucket):
- Choose File, Save Notebook.
- Rename the file name, and choose Rename and Save.
- Choose the Git sidebar and choose the plus sign next to the file name.
- Enter the commit message and choose COMMIT.
- Choose Push to Remote.
Create a workflow
Complete the following steps to create a workflow:
- On the top menu, under Build, choose Workflows.
- Choose Create new workflow.
- Choose the plus sign, then choose Data processing job.
- Choose the Data processing job node, then choose Browse jobs.
- Select
filter-books-synthetic-review
and choose Select.
- Choose the plus sign, then choose Querybook.
- Choose the Querybook node, then choose Browse files.
- Select
QueryBook-synthetic-review-
<timestamp>.sqlnb
and choose Select. - Choose the plus sign, then choose Notebook.
- Choose the Notebook node, then choose Browse files.
- Select
synthetics-review-result.ipynb
and choose Select.
At this point, you should have an end-to-end visual workflow. Now you can publish it.
- Choose Save to project to save the draft flow.
- Change Workflow name to
synthetic-review-workflow
and choose Save to project.
Run the workflow
To run your workflow, complete following steps:
- Choose Run on the workflow details page.
- Choose View runs to see the running workflow.
When the run is complete, you can check the notebook task result by choosing the run ID (manual__
<timestamp>), then choose the notebook task ID (notebook-task-xxxx
).
You can find the IDs of the top 10 customers who have contributed the most helpful votes in the notebook output.
Clean up
To avoid incurring future charges, clean up the resources you created during this walkthrough:
- On the workflows page, select your workflow, and under Actions, choose Delete workflow.
- On the Visual ETL flows page, select
filter-books-synthetics-review
, and under Actions, choose Delete flow. - In Query Editor, enter and run the following SQL to drop table: