AWS for Industries
Augmenting Datasets using Generative AI and Amazon Sagemaker for Autonomous Driving Use Cases on AWS
Autonomous vehicles (AV) must be driven hundreds of millions of miles – and sometimes hundreds of billions of miles – to capture sufficient data that can confidently demonstrate a vehicle’s security on the road. This approach to collecting data, however, sees its applicability hindered by two main limitations: (1) high fleet acquisition and maintenance costs and (2) the near impossibility of capturing exhaustive driving conditions. Such limitations suggest a need to supplement or modify existing real-world data using alternative approaches such as virtual testing, simulations and generation of synthetic data, among others.
The latest developments in the field of Machine Learning (ML) provide an alternative approach to using and maintaining large vehicle fleets for data collection. Generative artificial intelligence (GenAI) models are now capable of utilizing previously recorded sessions to produce new synthetic images and video data, representing fringe driving conditions for which real data was previously unavailable. While still an open research topic, this approach can potentially be used to produce more robust models, trained on an extensive variety of driving conditions at a fraction of the cost.
This blog post describes how GenAI can be used to help overcome some of the classical challenges in Computer Vision. Concretely, it will introduce a combination of cutting-edge models in computer vision and GenAI and explore different model configurations to understand the potential of applying this technology in the automotive industry.
Dataset Augmentation
Dataset augmentation is a crucial technique for training robust ML models for automotive applications such as self-driving cars. Many methods have augmented training data by applying geometric transformations (rotations, translations, shearing) or altering image properties like illumination and contrast. However, recent advances in generative models allow more localized, semantically-meaningful augmentations. Rather than relying solely on classical image processing for global modifications, modern techniques can realistically add, remove, or alter individual objects and scene attributes. For example, generative adversarial networks (GANs) and variational autoencoders (VAEs) have been used to change weather conditions, add or delete vehicles from street scenes, or modify facial attributes like glasses and hairstyles. These semantic augmentations better expose the model to nuances in the problem space during training, increasing generalization performance on complex real-world data. Overall, using state-of-the-art generative models moves data augmentation from simple image transformations to meaningful semantic manipulations that help produce more robust models.
This blog post will demonstrate a proposed concept workflow using cutting-edge Generative AI (GenAI) techniques to alter the conditions in videos captured by cameras mounted in data collection vehicles. With a single text prompt, sunny scenes can be seamlessly transformed to snowy or rainy environments. This is achieved by combining zero-shot learning models like Grounding DINO, Segment Anything and Stable Diffusion to intelligently add and remove objects in images. For completeness of this concept, this capability was also extended to video, utilizing optical flow to enable smooth transitions between frames. To enable practical deployment, these GenAI models were implemented as Amazon SageMaker notebooks so the blog reader can run the code and test this concept.
The proposed approach in this blog highlights the potential of GenAI to manipulate visual media with a concrete application in autonomous driving. It is not the intention of the authors to propose a finalized solution, but to share a proposed concept and technique that customers may consider to help inspire future applications while promoting responsible development.
Solution Overview
Prerequisites
This blog post is accompanied by an AWS CloudFormation template that will create the SageMaker notebooks used to deploy the models and run inference using a Sagemaker Notebook Instance.
The reader will need an AWS account with AWS Identity and Access Management (IAM) roles that provides access to the following:
- AWS CloudFormation
- Amazon SageMaker: 
         - Although Amazon SageMaker Notebooks provide instances to run ML models, in order to run heavy workloads like generative AI models, the use of GPU-enabled Amazon SageMaker Notebook is recommended. Refer to Amazon SageMaker Pricing for more information about pricing.
- The outputs showed in this blogpost were created using an NVIDIA A10G-enabled instance ml.g5.2xlargeto host the models.
- Amazon Simple Storage Service (Amazon S3)
 
For more details, check out the GitHub repository and the AWS CloudFormation template.
Architecture
 Figure 1: Combining Grounding DINO, Segment Anything and Stable Diffusion
Figure 1: Combining Grounding DINO, Segment Anything and Stable Diffusion
Figure 1 shows the chained model workflow used to alter a source image using a search and replace prompt. Following the steps in the figure, in [1] the image is passed together with a search prompt to identify objects or an area in the image. In [2], Grounding Dino is used to add bounding boxes around the detected objects. Segment Anything Model (SAM) is used in [3] to draw a precise mask around the objects. As a final step, in [4] the original image is passed to Stable Diffusion, together with the mask and a replace prompt. In this combination, the mask indicates a region in the image that must be replaced by the object or theme indicated in the replace prompt. The end result of this workflow is an altered image, which differs from the source only in the contents created within the mask boundaries.
 Figure 2: Optical Flow and Image Modification
Figure 2: Optical Flow and Image Modification
Figure 2 conceptually describes how the results of the first stage are used in conjunction with optical flow. The basic idea of optical flow is to find pixel correspondences between consecutive images. In this solution, optical flow is used to transform the generated images (rightmost side of Figure 1) according to the pixel displacement between two consecutive images (leftmost side of Figure 1). This procedure is then applied to all frames, creating smoother sequential image representations (i.e. video-like content)
Concrete Implementation
This solution relies on a pipeline of several generative models and techniques based on the work proposed in Grounded-SAM, specifically exploring the combination of GroundingDINO, Segment Anything Model (SAM), Stable Diffusion and Optical Flow for temporal coherence between frames.
Following the flow shown in Figure 3:
1. The first step is to extract frames from the video for separate analysis. Modern AI vision models perform best on static images rather than directly on video. The object detection task is run on each frame to identify regions of interest based on a text prompt. For example, prompting for “cars in the left lane” focuses the model on relevant objects, returning bounding boxes around the objects of interest.
2. These boxes are used to focus semantic segmentation model, which replaces the broader bounding boxes for a precise mask wrapped around the contour of the object (see Figure 5 for an example of these masks).
3. The inpainting task is handled by Stable Diffusion Inpainting, which replaces the content within the mask for new content, guided by a second prompt indicating the new content to be generated, or left empty to remove the original objects from the image.
4. Next, to achieve visual coherence between generated images, the altered regions are homogenized across frames using optical flow. By matching pixel movements, optical flow connects corresponding areas in sequential frames. If a car moves from left to right, optical flow propagates the selection mask smoothly. This stabilization keeps selections locked onto moving objects. Finally, a generative neural network seamlessly blends the generated imagery into the original footage. The resulting video shows nearly imperceptible transitions between real and synthetic elements.
Overall, this multi-step process uses modern AI techniques for object detection, segmentation, tracking and generation to manipulate video in a photorealistic manner.
 Figure 3: Model Invocation Pipeline
Figure 3: Model Invocation Pipeline
In the following section, a concrete use case will be explained, using this pipeline to modify existing frames into a completely new scene. This use case was presented during a Builder’s Session at AWS re:Invent 2023.
Zero-shot foundational models for Object Detection
 Zero-shot learning is a Generative AI technique that allows a model to perform a machine learning task (such as object detection of a given class) without any specific training for some classes. Modern zero-shot learning in images allows you to find or segment objects by just using a single prompt. For example, models such as Grouding DINO work by using transformer-based detector and pre-training technique so it learns the association between language and vision modalities
1. Find objects using Grounding DINO
In this first example, the object detection task will focus on street objects such as stop signs, houses with a specific positioning on the image and traffic lights, which will be defined as a prompt, i.e. “stop sign, traffic lights, houses on the left”. In the accompanying code to this blogpost, the following functions were built to abstract this functionality.
1. GenAiModels is just a holder for an inference pipeline based on Grounded-Segment-Anything. This pipeline chains the following elements in succession:
a. Grounding Dino: an object detection model. It uses a prompt in natural language as input to detect objects for which it has not been explicitly trained (zero-shot learning).
 b. SamProcessor: an image processor used for preparing images to be used with the SAM model.
 c. SAM Model: a model capable of segmenting objects in an image using bounding boxes or points as inputs.
 d. Stable Diffusion: a model used to generate images based on text prompts. Specifically for this use case, a specific version capable of inpainting tasks will be used. This means that the model can replace a specific part of an image maintaining coherence in its context.
2. The Prompts object holds the prompts to select and replace. Figure 4 shows the original before any object detection, segmentation or inpainting process.
2. Perform segmentation using Segment Anything
Similarly, Grounding DINO can be used in tandem with a powerful segmentation model such as SegmentAnything to perform detailed segmentation without training for it. Similar to zero-shot object detection methods, it finds the relationship between language and fine grain visual segmentation. These methods are usually trained using over 10 million images.
Figure 5 illustrates an example of integrating zero-shot object detection with automated segmentation. This is achieved by utilizing a straightforward prompt, such as “stop sign, traffic lights, house on the right”. This approach combines the capabilities of zero-shot object detection, which can identify objects without prior training, with the precision of automatic segmentation, which outlines the boundaries of the detected objects.
 Figure 5: Image with Semantic Segmentation Applied
Figure 5: Image with Semantic Segmentation Applied
3. Replace an object using Stable Diffusion Inpainting
Traditional Inpainting algorithms deal with replacing information from an image using the context of the rest of the image. Today, a common use of this technology is for smartphone users to remove unwanted objects from pictures. Generative AI methods push this boundary further by replacing a selected area with an object described inside a prompt. For example, you can replace a plain street with a flooded street. Figure 6 shows how can the aforementioned zero shot methods be combined with image inpainting.
 Figure 6: Image with Inpainting Applied
Figure 6: Image with Inpainting Applied
4. Prompted Segmentation in a group of images
The concept of segmentation can be scaled to a video by using the same segmentation prompt across a set of frames, such as those contained in file formats used for autonomous driving e.g., ROS bag. Figure 7 shows an example of a car driving on a road and automatically finding the road by using a prompt such as “vehicle lane.”
 Figure 7: Segmentation Applied to actual drive data
Figure 7: Segmentation Applied to actual drive data
5. Replace area with a new texture
In the example below (Figure 8) the “vehicle lane” selection is now replaced across all frames with a “flooded street”. Merging all consecutive, generated frames allows for a reconstruction of the video, now demonstrating the altered scene. Note that it is common practice to add a “negative prompt” to avoid the generation of images with unwanted artifacts.
 Figure 8: Segmented area replaced with “flooded street texture”
Figure 8: Segmented area replaced with “flooded street texture”
6. Introducing temporal information in the frame sequence generation
Finally, optical flow can also be used to produce a smoother transition across consecutive frames. In the example below, the vehicle lane is replaced with a “snowy street”
Algorithms for optical flow need to learn how to:
1. Find correspondence between points.
 2. Compute the relative offsets between points.
 3. Predict flow across large regions of space, even to parts of the image that lack texture for correspondence.
 4. The learned procedure needs to generalize to real data, which means it needs to work for objects and textures that were not seen in the training data.
The accompanying code, implements two versions of optical flow: (1) using the classic Lucas-Kanade algorithm and (2) using Perceiver-IO.
Figure 9 shows the result of changing the street environment by using a prompt such as “snowy street, 4k”.
 Figure 9: Segmented area replaced with “snowy street” with temporal smoothing
Figure 9: Segmented area replaced with “snowy street” with temporal smoothing
Cleanup
Deleting the CloudFormation stack would remove all the resources that were originally created. If the deletion fails and returns a DELETE_FAILED state, you can choose to delete the stack by force.
Conclusion
This blog post illustrated a combination of models and techniques for performing automated, customizable augmentations on visual content using Generative AI and computer vision with focus on autonomous driving. By simplifying the interaction with models using natural language prompts, users can create alterations such as changing the weather or removing objects from the scene. In addition, this post also shown how these single-frame augmentations can be extended to video-like content using computer vision techniques like optical flow.
As evidenced in the results, the technology still requires maturing to produce visuals that can rival those of real driving scenes. However, it is possible for developers to already find value in implementing this type of approach: slight alterations to existing scenes, for example, can expose models to a broader range of visual elements that can increase the robustness of their predictions, while maintaining the visual quality of the source content.
As a final comment, It is worth noting that the described approach is meant as an exercise in conceptual development for this industry, and should not be viewed as a production ready solution.
