AWS Public Sector Blog

How to use life sciences data from AWS Open Data program in Amazon Bedrock

AWS Branded Background with text "How to use life sciences data from AWS Open Data program in Amazon Bedrock"

The Amazon Web Services (AWS) Open Data Sponsorship Program eliminates data acquisition barriers by hosting high-value datasets in the cloud, enabling researchers and analysts to focus on discovery and innovation rather than data management. When data is shared on AWS, anyone can analyze it and build services on top of it using a broad range of compute and data analytics products, including Amazon Elastic Compute Cloud (Amazon EC2), Amazon Athena, AWS Lambda, and Amazon EMR. AWS provides a catalog of publicly available datasets on AWS through the Registry of Open Data on AWS. The registry has over 700 datasets open to the public, such as government data, scientific research, and life sciences, climate, satellite imagery, geospatial, and genomic data.

Many research consortiums, such as the Human Cell Atlas (HCA), participate in the AWS Open Data Sponsorship program. Visit Human Cell Atlas to view their full dataset. We use HCA data from the Registry of Open Data on AWS to demonstrate how customers can use data made available through AWS Open Data in Amazon Bedrock.

In this post, we discuss how to use datasets in the Registry of Open Data on AWS with Amazon Bedrock Knowledge Bases. With Amazon Bedrock Knowledge Bases, you can give foundation models (FMs) and agents contextual information from private and public data sources to deliver more relevant, accurate, and customized responses. By using the metadata from datasets such as HCA’s Asian Immune Diversity Atlas (AIDA), you can open up information such as age, sex, smoking status, and body mass index (BMI), to a set of users that might not be comfortable with SQL commands or other tools commonly used to search these types of data. Now, nontechnical decision-makers have access to technical data in an accessible and understandable format through a chat-based assistant.

Dataset overview

We use HCA’s AIDA dataset, which can be found in s3://humancellatlas/temp/AIDA/ as our knowledge base. AIDA contains transcriptome variation data from five major Asian population groups (Chinese, Japanese, Korean, Indian, and Malay), and characterizes their variation associated with ethnicity, environment, age, sex, smoking status, and BMI. The dataset contains 1700 fastq files, and the metadata is formatted as tsv files available in a public Amazon Simple Storage Service (Amazon S3) bucket in the Registry of Open Data on AWS.

It’s recommended to delete the knowledge base and structured data store when you’re done to stop incurring additional costs.

Prerequisites

To perform the solution, you need to have the following prerequisites:

  1. Access to an AWS account with permissions to the relevant services
  2. Familiarity with the AWS console

Solution walkthrough

Start with the Registry of Open Data on AWS:

  1. In a web browser, go to the Registry of Open Data on AWS
  2. In the Search datasets bar, enter HCA, as shown in the following screenshot.

    Figure 1: Registry of Open Data on AWS with HCA in the search datasets box

  3. Scroll if needed and choose the Human Cell Atlas dataset to open the registry page, shown in the following screenshot.

    Figure 2: Human Cell Atlas dataset on Registry of Open Data on AWS

  4. The HCA dataset doesn’t have an option to browse buckets, but it can be accessed through the AWS Command Line Interface (AWS CLI). For information about how to install and configure the tool, visit What is the AWS Command Line Interface?. When the AWS CLI has been installed and configured, enter the following to browse your bucket:aws s3 ls --no-sign-request s3://humancellatlas/
  5. To navigate to a folder within the bucket, add the folder name to the path in your command. For example, to navigate to the temp/AIDA folder, enter:aws s3 ls --no-sign-request s3://humancellatlas/temp/AIDA/
  6. Enter the following command to download the mmc1.xlsx-Table_S1.tsv file and copy it to your current directory:aws s3 cp --no-sign-request s3://humancellatlas/temp/AIDA/mmc1.xlsx-Table_S1.tsv .
  7. To view the data, open mmc1.xlsx-Table_S1.tsv in an application.

File and data format

The format of the file and the data within are important when considering a dataset as a knowledge base. You need to understand what is in the file and how it’s referenced so you can structure your questions to get the most appropriate answers.

For example, data in the following table has the columns DCP_ID, Self-reported ethnicity, Age, Country, Sex, BMI, Smoking Status, and scRNA-seq Experimental Batch. This post focuses on these columns. For documentation on each column, visit Asian diversity in human immune cells in PubMed.

Figure 3: Data in columns in the mmc1.xlsx-Table_S1.tsv table

Create a private S3 bucket

To create a private S3 bucket to contain the mmc1.xlsx-Table_S1.tsv file, follow these steps:

  1. On the Amazon S3 console, choose Create bucket and name the bucket YOURNAME-aida, replacing YOURNAME with your information. Leave everything else on the page as defaults and at the bottom of the page, choose Create bucket.
  2. When the bucket is created, choose it to view the bucket contents (it will be empty). Drag and drop the mmc1.xlsx-Table_S1.tsv file into the bucket. Alternatively, you can upload the file by choosing Upload.

When the upload is successful, you’ll receive an Upload succeeded notification, as shown in the following screenshot.

Figure 4: Upload of mmc1.xlsx-Table_S1.tsv in the bucket

Create Amazon Redshift serverless environment and load data

  1. For a tutorial about loading data into Amazon Redshift, visit Tutorial: Loading data from Amazon S3.
  2. On the Amazon Redshift console, you will Create an Amazon Redshift serverless environment to use
    • If you haven’t used Amazon Redshift before, choose Get started with Amazon Redshift serverless
      1. Keep the defaults. This creates a dev serverless database.
      2. Wait for it to be completed before continuing to the following step.
    • Choose Create workgroup.
      1. Name it dev-hcls-workgroup.
        • Keep the defaults.
        • Choose Next
      2. Create new namespace called dev-hcls-namespace.
      3. Choose Create IAM role, and choose to allow Any or Specific S3 buckets
      4. Choose Next, then Create.
    • Wait for it to be complete before continuing to the next step.
    • Choose dev-hcls-namespace.
    • Choose the Security and Encryption tab and copy the IAM role to the clipboard. It should be similar to: AmazonRedshift-CommandsAccessRole-20250606T131327
    • Choose Query data. This section lets you query and load data (For more information, refer to Get started with Amazon Redshift Serverless data warehouses).
      • To connect to Amazon Redshift, choose Serverless: dev-hcls-workgroup.
        1. Keep the defaults (Federated connection) and choose Create connection.
        2. Expand native databases, dev, public, tables – note that tables are 0
    • Choose Load data
      • In the S3 bucket (default), enter the name or browse to your bucket name and choose s3://<YOURNAME>-aida/mmc1.xlsx-Table_S1.tsv.
        1. Set Region, such as us-east-1
        2. Keep the default file format as CSV.
        3. Change Ignore header rows to 2.
        4. Change the delimiter to \t, which is tab-delimited.
        5. Choose Data conversion parameters.
          1. Choose Load specified string as null and enter NA in the field for the null string. We set this so that we can load rows where the “Smoking Status” is set to “NA” instead of 0 or 1.
        6. Choose Next
      • Choose Load new Table
        1. Make sure the cluster or workgroup is set to dev-hcls-workgroup
        2. Choose Dev for Database
        3. Choose public for Schema
        4. Name the table DONOR.
        5. Use AmazonRedshift-CommandsAccessRole-### that you copied to the clipboard earlier.

The table columns might not have correct column names, and might be named col0, col1, col2, and so on. Double click on each column name and change them to match the column names in the file that you downloaded earlier. The following screenshot shows columns with their labels.

Figure 5: mmc1.xlsx-Table_S1.tsv data in columns

    • Choose Create table.
      • Wait for table to create before continuing to the next step.
      • Choose Cancel to exit the dialog box because you will load the data manually. Use the following command, replacing YOURNAME, ACCOUNT-ID, and ARN with your information:
        COPY dev.public.donor FROM 's3://<YOURNAME>-aida/mmc1.xlsx-Table_S1.tsv' IAM_ROLE 'arn:aws:iam::<ACCOUNT-ID>:role/service-role/AmazonRedshift-CommandsAccessRole-<ARN>' FORMAT AS CSV DELIMITER '\t' QUOTE '"' IGNOREHEADER 2 REGION AS 'us-east-1'  NULL 'NA'
      • Select the entire command and choose Run
      • Now we want to check it. To check the data, follow these steps:
        1. Enter a couple of new lines below the COPY dev.public.DONOR command that’s in your existing Amazon Redshift script window.
        2. Enter the following command into the window below the COPY dev.public.DONOR command:
          SELECT *
          FROM dev.public.DONOR
          LIMIT 10;
      • Select the entire command and choose RUN.

Figure 6: Rows of data

You now have data in Amazon Redshift. In the next section, we show how to use it as a knowledge base. Leave the query editor open because you need to return to it to add some permissions. Add a few new lines so you have room to run additional commands later. If something went wrong, use the following code to find the error, replacing the error ID:

SELECT query_id,
       table_id,
       start_time,
       trim(file_name) AS file_name,
       trim(column_name) AS column_name,
       trim(column_type) AS column_type,
       trim(error_message) AS error_message
FROM sys_load_error_detail
WHERE query_id = ERRORID
ORDER BY start_time
LIMIT 10;

If you need to redo the table creation and import data, you can use the script pane. To remove the table and start over, follow these steps:

  1. Enter the following command, select the entire command, and choose Run:

DROP table donor;

  1. Retry the steps to create the table and load the data, starting with choosing Load data.

Create knowledge base with structured data

In this section, you’ll use the data loaded into Amazon Redshift from Amazon Bedrock as a knowledge base. Before you can use Amazon Bedrock, you need to allow access to at least one FM. Follow these steps:

  1. On the Amazon Bedrock console in the left navigation pane, choose Model Access then Modify model access.
  2. Select Titan Text G1 – Premier, Titan Embeddings G1 – Text, and Nova Lite.

To create a knowledge base, follow these steps:

  1. On the Amazon Bedrock console under Build, choose Knowledge bases.
  2. Choose Create and choose Knowledge Base with structured data store. Name it aida-structured-data and leave the query engine as Amazon Redshift.
  3. Choose Next.
  4. To configure the query engine:
    • Select workrgoup: dev-hcls-workgroup
    • Copy IAM Role – we’ll need to add some permissions
      • AmazonBedrockExecutionRoleForKnowledgeBase_vowgg
    • Select Database – dev
    • Select Next, review your choices then select Create Knowledge Base
  5. Review your choices and choose Create knowledge base. Wait for the knowledge base to be created before continuing to the next step. Do not sync yet.
  6. Before you sync, you need to create an Amazon Redshift user for this IAM role. On the Amazon Redshift console query editor, enter the following commands. Invoke each command by selecting the entire command and selecting Run:
    • Go back to Amazon Redshift console, query editor and paste the following commands. Change the Execution role name to the one that you pasted the clipboard in Step 4b. Run each command by selecting the entire command and selecting Run:/* create an Amazon Redshift user for this IAM role */
      CREATE USER "IAMR:AmazonBedrockExecutionRoleForKnowledgeBase_vowgg" WITH PASSWORD DISABLE;/* access to all tables*/
      GRANT SELECT ON ALL TABLES IN SCHEMA public TO "IAMR:AmazonBedrockExecutionRoleForKnowledgeBase_vowgg";/* access to new schema */
      GRANT USAGE ON SCHEMA public TO "IAMR:AmazonBedrockExecutionRoleForKnowledgeBase_vowgg";
  7. Return to your knowledge base in Amazon Bedrock and choose Sync. Wait for the sync to be complete before continuing to the next step. This might take some time depending upon how much data you are syncing from Amazon Redshift.
  8. To test it, choose Test knowledge base.
  9. Choose a model. Amazon Nova Lite is a good choice because it’s a bit cheaper than the other Amazon Nova models.
  10. Enter a question in the Prompt box and choose Enter.

Here are some sample question-and-answer pairs. To learn how Amazon Bedrock derived its answer, choose Details. You can collapse configurations in the navigation pane to have more room.

Question:

What variables are in this dataset?

Answer:

The dataset includes several variables: dcp_id, self-reported ethnicity, age, country, sex, bmi, smoking status, and scrna-seq experimental batch. Each row represents a donor with these attributes. For example, the first donor has a dcp_id of IN_NIB_H019, is Indian, 25 years old, from India, female, a BMI of 22.2, non-smoker, and was part of the IN_NIB_B001 scrna-seq experimental batch. The dataset contains 669 such entries.

For this example, viewing the details shows that the model turned the question into the following SQL statement:

SELECT "dcp_id", "self-reported ethnicity", "age", "country", "sex", "bmi", "smoking status", "scrna-seq experimental batch" FROM public.donor;

Next question:

What is the average age?

Answer:

40 years

For this example, viewing the details shows that the model turned the question into the following SQL statement:

SELECT AVG(CAST(age AS NUMERIC)) AS "Average Age" FROM public.donor;

Next question:

What is the highest bmi?

Answer:

The highest BMI recorded among the donors is 38.20

For this example, viewing the details shows that the model turned the question into the following SQL statement:

SELECT MAX("bmi") AS "Highest BMI" FROM public.donor;

Next question:

How many have “Smoking status” of 1?

Answer:

There are 162 donors in the dataset with a 'Smoking status' of 1.

For this example, viewing the details shows that the model turned the question into the following SQL statement:

SELECT COUNT(*) AS "Number of Donors with Smoking Status 1" FROM public.donor WHERE "smoking status" = '1';

Last question:

What is the percentage of females in this study?

Answer:

Based on the dataset, there are no females recorded, as the percentage of females is 0.00%.

This didn’t work, but why? Choose Details to find the SQL statement that was generated:

SELECT round(100.0 * sum(case when "sex" = 'female' then 1 else 0 end) / count(*), 2) as "Percentage of Females" FROM public.donor;

Copy the SQL statement and go back to Amazon Redshift. Paste the entire SELECT statement into the script window, select the entire command, and choose Run. It returns 0 (which is the same as Amazon Bedrock). To find out why, compare the command with the first few rows of data. Copy and paste this entire command into the script editor and choose Run:

SELECT *
FROM dev.public.donor
LIMIT 10;

Compare the columns in the Amazon Redshift table with the command. In the command, you’re looking for female, yet the rows have either Female or Male. Case matters! To address the issue, you can lowercase the entire column with the following command and update the table to keep the change:

UPDATE donor SET sex = LOWER(sex);

Check the first ten rows again:

SELECT *
FROM dev.public.DONOR
LIMIT 10;

Note that the values are now female or male.

To update the table to keep the change, go to Bedrock, Knowledge Bases, aida- structured-data knowledge base, and to the right of Query Engine choose Sync and wait for the sync to complete before continuing. To ask the question again, choose Test knowledge base and choose a model as you did previously. Reenter the question, and you should receive the answer that the percentage of females in the study is 53.96%. Choose Details to learn the command that was generated:

SELECT round(100.0 * sum(case when "sex" = 'female' then 1 else 0 end) / count(*), 2) as "Percentage of Females" FROM public.donor;

Cleanup

To avoid incurring future charges, you need to delete the knowledge base. Follow these steps:

  1. On the Amazon Bedrock console under Build, select the aida-structured-data knowledge base
  2. Choose Delete and then type delete in the window.

Conclusion

The Human Cell Atlas is a global consortium that is mapping every cell type in the human body, creating a three-dimensional atlas of human cells to transform our understanding of biology and disease. The atlas is likely to lead to major advances in the way illnesses are diagnosed and treated. The goal of the atlas is to build comprehensive reference maps of all human cells—the fundamental units of life—as a basis for understanding basic human biological processes to diagnose, monitor, and treat disease. Its goal is to help scientists understand how genetic variants impact disease risk, define drug toxicities, discover better therapies, and advance regenerative medicine. The first phase of this ambitious project brings together a suite of flagship projects in key tissues, systems, and organs, including the lung, heart, liver, and immune system. An open global initiative, the Human Cell Atlas Consortium was founded in 2016 and has grown to more than 3,900 HCA members from over 1,700 institutes and more than 100 countries around the world.

The Registry of Open Data on AWS contains over 700 datasets (with more than 180 life sciences datasets) available to the public that can be used to add additional context to a FM. Amazon Bedrock now supports using public datasets in the Registry of Open Data on AWS, so you don’t have to maintain a copy of the dataset. Check out the registry to learn if there are datasets available for you to use with your next project.

Resources

Chris Stoner

Chris Stoner

Chris is the open environmental and geospatial data lead for the AWS Open Data team. Chris was previously the lead product manager for AWS Ground Station, developing “antennas as a service” for space customers. Chris also worked as a NASA contractor at the Alaska Satellite Facility (ASF) Distributed Active Archive Center (DAAC), developing architectures for Sentinel-1 and NISAR missions in the cloud. Chris has an MBA from the University of Massachusetts – Amherst and a bachelor’s degree in IT from the University of Massachusetts – Lowell. Chris is a published author of technical journal articles and holds several patents.

Beryl Rabindran

Beryl Rabindran

Beryl is the life sciences lead on the AWS Open Data team. Beryl is a yeast geneticist and cell biologist by training and led clinical research for a medical technology AI startup in the breast cancer imaging space before joining AWS. She is passionate about working directly with researchers from around the world to grow the community of open life sciences data users. Beryl has a PhD in life sciences from the National University of Singapore and an MBA from Cornell University.