AWS Public Sector Blog
How to use life sciences data from AWS Open Data program in Amazon Bedrock
The Amazon Web Services (AWS) Open Data Sponsorship Program eliminates data acquisition barriers by hosting high-value datasets in the cloud, enabling researchers and analysts to focus on discovery and innovation rather than data management. When data is shared on AWS, anyone can analyze it and build services on top of it using a broad range of compute and data analytics products, including Amazon Elastic Compute Cloud (Amazon EC2), Amazon Athena, AWS Lambda, and Amazon EMR. AWS provides a catalog of publicly available datasets on AWS through the Registry of Open Data on AWS. The registry has over 700 datasets open to the public, such as government data, scientific research, and life sciences, climate, satellite imagery, geospatial, and genomic data.
Many research consortiums, such as the Human Cell Atlas (HCA), participate in the AWS Open Data Sponsorship program. Visit Human Cell Atlas to view their full dataset. We use HCA data from the Registry of Open Data on AWS to demonstrate how customers can use data made available through AWS Open Data in Amazon Bedrock.
In this post, we discuss how to use datasets in the Registry of Open Data on AWS with Amazon Bedrock Knowledge Bases. With Amazon Bedrock Knowledge Bases, you can give foundation models (FMs) and agents contextual information from private and public data sources to deliver more relevant, accurate, and customized responses. By using the metadata from datasets such as HCA’s Asian Immune Diversity Atlas (AIDA), you can open up information such as age, sex, smoking status, and body mass index (BMI), to a set of users that might not be comfortable with SQL commands or other tools commonly used to search these types of data. Now, nontechnical decision-makers have access to technical data in an accessible and understandable format through a chat-based assistant.
Dataset overview
We use HCA’s AIDA dataset, which can be found in s3://humancellatlas/temp/AIDA/ as our knowledge base. AIDA contains transcriptome variation data from five major Asian population groups (Chinese, Japanese, Korean, Indian, and Malay), and characterizes their variation associated with ethnicity, environment, age, sex, smoking status, and BMI. The dataset contains 1700 fastq files, and the metadata is formatted as tsv files available in a public Amazon Simple Storage Service (Amazon S3) bucket in the Registry of Open Data on AWS.
It’s recommended to delete the knowledge base and structured data store when you’re done to stop incurring additional costs.
Prerequisites
To perform the solution, you need to have the following prerequisites:
- Access to an AWS account with permissions to the relevant services
- Familiarity with the AWS console
Solution walkthrough
Start with the Registry of Open Data on AWS:
- In a web browser, go to the Registry of Open Data on AWS
- In the Search datasets bar, enter
HCA
, as shown in the following screenshot. - Scroll if needed and choose the Human Cell Atlas dataset to open the registry page, shown in the following screenshot.
- The HCA dataset doesn’t have an option to browse buckets, but it can be accessed through the AWS Command Line Interface (AWS CLI). For information about how to install and configure the tool, visit What is the AWS Command Line Interface?. When the AWS CLI has been installed and configured, enter the following to browse your bucket:
aws s3 ls --no-sign-request s3://humancellatlas/
- To navigate to a folder within the bucket, add the folder name to the path in your command. For example, to navigate to the temp/AIDA folder, enter:
aws s3 ls --no-sign-request s3://humancellatlas/temp/AIDA/
- Enter the following command to download the mmc1.xlsx-Table_S1.tsv file and copy it to your current directory:
aws s3 cp --no-sign-request s3://humancellatlas/temp/AIDA/mmc1.xlsx-Table_S1.tsv .
- To view the data, open mmc1.xlsx-Table_S1.tsv in an application.
File and data format
The format of the file and the data within are important when considering a dataset as a knowledge base. You need to understand what is in the file and how it’s referenced so you can structure your questions to get the most appropriate answers.
For example, data in the following table has the columns DCP_ID, Self-reported ethnicity, Age, Country, Sex, BMI, Smoking Status, and scRNA-seq Experimental Batch. This post focuses on these columns. For documentation on each column, visit Asian diversity in human immune cells in PubMed.
Create a private S3 bucket
To create a private S3 bucket to contain the mmc1.xlsx-Table_S1.tsv file, follow these steps:
- On the Amazon S3 console, choose Create bucket and name the bucket
YOURNAME-aida
, replacingYOURNAME
with your information. Leave everything else on the page as defaults and at the bottom of the page, choose Create bucket. - When the bucket is created, choose it to view the bucket contents (it will be empty). Drag and drop the mmc1.xlsx-Table_S1.tsv file into the bucket. Alternatively, you can upload the file by choosing Upload.
When the upload is successful, you’ll receive an Upload succeeded notification, as shown in the following screenshot.
Create Amazon Redshift serverless environment and load data
- For a tutorial about loading data into Amazon Redshift, visit Tutorial: Loading data from Amazon S3.
- On the Amazon Redshift console, you will Create an Amazon Redshift serverless environment to use
- If you haven’t used Amazon Redshift before, choose Get started with Amazon Redshift serverless
- Keep the defaults. This creates a
dev
serverless database. - Wait for it to be completed before continuing to the following step.
- Keep the defaults. This creates a
- Choose Create workgroup.
- Name it
dev-hcls-workgroup
.- Keep the defaults.
- Choose Next
- Create new namespace called
dev-hcls-namespace
. - Choose Create IAM role, and choose to allow Any or Specific S3 buckets
- Choose Next, then Create.
- Name it
- If you haven’t used Amazon Redshift before, choose Get started with Amazon Redshift serverless
-
- Wait for it to be complete before continuing to the next step.
- Choose
dev-hcls-namespace
. - Choose the Security and Encryption tab and copy the IAM role to the clipboard. It should be similar to:
AmazonRedshift-CommandsAccessRole-20250606T131327
- Choose Query data. This section lets you query and load data (For more information, refer to Get started with Amazon Redshift Serverless data warehouses).
- To connect to Amazon Redshift, choose Serverless: dev-hcls-workgroup.
- Keep the defaults (Federated connection) and choose Create connection.
- Expand native databases, dev, public, tables – note that tables are 0
- To connect to Amazon Redshift, choose Serverless: dev-hcls-workgroup.
- Choose Load data
- In the S3 bucket (default), enter the name or browse to your bucket name and choose
s3://<YOURNAME>-aida/mmc1.xlsx-Table_S1.tsv
.- Set Region, such as
us-east-1
- Keep the default file format as CSV.
- Change Ignore header rows to 2.
- Change the delimiter to
\t
, which is tab-delimited. - Choose Data conversion parameters.
- Choose Load specified string as null and enter
NA
in the field for the null string. We set this so that we can load rows where the “Smoking Status” is set to “NA” instead of 0 or 1.
- Choose Load specified string as null and enter
- Choose Next
- Set Region, such as
- Choose Load new Table
- Make sure the cluster or workgroup is set to dev-hcls-workgroup
- Choose Dev for Database
- Choose public for Schema
- Name the table
DONOR
. - Use AmazonRedshift-CommandsAccessRole-### that you copied to the clipboard earlier.
- In the S3 bucket (default), enter the name or browse to your bucket name and choose
The table columns might not have correct column names, and might be named col0, col1, col2, and so on. Double click on each column name and change them to match the column names in the file that you downloaded earlier. The following screenshot shows columns with their labels.
-
- Choose Create table.
- Wait for table to create before continuing to the next step.
- Choose Cancel to exit the dialog box because you will load the data manually. Use the following command, replacing
YOURNAME, ACCOUNT-ID
, andARN
with your information:
COPY dev.public.donor FROM 's3://<YOURNAME>-aida/mmc1.xlsx-Table_S1.tsv' IAM_ROLE 'arn:aws:iam::<ACCOUNT-ID>:role/service-role/AmazonRedshift-CommandsAccessRole-<ARN>' FORMAT AS CSV DELIMITER '\t' QUOTE '"' IGNOREHEADER 2 REGION AS 'us-east-1' NULL 'NA'
- Select the entire command and choose Run
- Now we want to check it. To check the data, follow these steps:
- Enter a couple of new lines below the
COPY dev.public.DONOR
command that’s in your existing Amazon Redshift script window. - Enter the following command into the window below the
COPY dev.public.DONOR
command:
SELECT *
FROM dev.public.DONOR
LIMIT 10;
- Enter a couple of new lines below the
- Select the entire command and choose RUN.
- Choose Create table.
You now have data in Amazon Redshift. In the next section, we show how to use it as a knowledge base. Leave the query editor open because you need to return to it to add some permissions. Add a few new lines so you have room to run additional commands later. If something went wrong, use the following code to find the error, replacing the error ID:
SELECT query_id,
table_id,
start_time,
trim(file_name) AS file_name,
trim(column_name) AS column_name,
trim(column_type) AS column_type,
trim(error_message) AS error_message
FROM sys_load_error_detail
WHERE query_id = ERRORID
ORDER BY start_time
LIMIT 10;
If you need to redo the table creation and import data, you can use the script pane. To remove the table and start over, follow these steps:
- Enter the following command, select the entire command, and choose Run:
DROP table donor;
- Retry the steps to create the table and load the data, starting with choosing Load data.
Create knowledge base with structured data
In this section, you’ll use the data loaded into Amazon Redshift from Amazon Bedrock as a knowledge base. Before you can use Amazon Bedrock, you need to allow access to at least one FM. Follow these steps:
- On the Amazon Bedrock console in the left navigation pane, choose Model Access then Modify model access.
- Select Titan Text G1 – Premier, Titan Embeddings G1 – Text, and Nova Lite.
To create a knowledge base, follow these steps:
- On the Amazon Bedrock console under Build, choose Knowledge bases.
- Choose Create and choose Knowledge Base with structured data store. Name it
aida-structured-data
and leave the query engine as Amazon Redshift. - Choose Next.
- To configure the query engine:
- Select workrgoup: dev-hcls-workgroup
- Copy IAM Role – we’ll need to add some permissions
- AmazonBedrockExecutionRoleForKnowledgeBase_vowgg
- Select Database – dev
- Select Next, review your choices then select Create Knowledge Base
- Review your choices and choose Create knowledge base. Wait for the knowledge base to be created before continuing to the next step. Do not sync yet.
- Before you sync, you need to create an Amazon Redshift user for this IAM role. On the Amazon Redshift console query editor, enter the following commands. Invoke each command by selecting the entire command and selecting Run:
- Go back to Amazon Redshift console, query editor and paste the following commands. Change the Execution role name to the one that you pasted the clipboard in Step 4b. Run each command by selecting the entire command and selecting Run:
/* create an Amazon Redshift user for this IAM role */
CREATE USER "IAMR:AmazonBedrockExecutionRoleForKnowledgeBase_vowgg" WITH PASSWORD DISABLE;
/* access to all tables*/
GRANT SELECT ON ALL TABLES IN SCHEMA public TO "IAMR:AmazonBedrockExecutionRoleForKnowledgeBase_vowgg";
/* access to new schema */
GRANT USAGE ON SCHEMA public TO "IAMR:AmazonBedrockExecutionRoleForKnowledgeBase_vowgg";
- Go back to Amazon Redshift console, query editor and paste the following commands. Change the Execution role name to the one that you pasted the clipboard in Step 4b. Run each command by selecting the entire command and selecting Run:
- Return to your knowledge base in Amazon Bedrock and choose Sync. Wait for the sync to be complete before continuing to the next step. This might take some time depending upon how much data you are syncing from Amazon Redshift.
- To test it, choose Test knowledge base.
- Choose a model. Amazon Nova Lite is a good choice because it’s a bit cheaper than the other Amazon Nova models.
- Enter a question in the Prompt box and choose Enter.
Here are some sample question-and-answer pairs. To learn how Amazon Bedrock derived its answer, choose Details. You can collapse configurations in the navigation pane to have more room.
Question:
What variables are in this dataset?
Answer:
The dataset includes several variables: dcp_id, self-reported ethnicity, age, country, sex, bmi, smoking status, and scrna-seq experimental batch. Each row represents a donor with these attributes. For example, the first donor has a dcp_id of IN_NIB_H019, is Indian, 25 years old, from India, female, a BMI of 22.2, non-smoker, and was part of the IN_NIB_B001 scrna-seq experimental batch. The dataset contains 669 such entries.
For this example, viewing the details shows that the model turned the question into the following SQL statement:
SELECT "dcp_id", "self-reported ethnicity", "age", "country", "sex", "bmi", "smoking status", "scrna-seq experimental batch" FROM public.donor;
Next question:
What is the average age?
Answer:
40 years
For this example, viewing the details shows that the model turned the question into the following SQL statement:
SELECT AVG(CAST(age AS NUMERIC)) AS "Average Age" FROM public.donor;
Next question:
What is the highest bmi?
Answer:
The highest BMI recorded among the donors is 38.20
For this example, viewing the details shows that the model turned the question into the following SQL statement:
SELECT MAX("bmi") AS "Highest BMI" FROM public.donor;
Next question:
How many have “Smoking status” of 1?
Answer:
There are 162 donors in the dataset with a 'Smoking status' of 1.
For this example, viewing the details shows that the model turned the question into the following SQL statement:
SELECT COUNT(*) AS "Number of Donors with Smoking Status 1" FROM public.donor WHERE "smoking status" = '1';
Last question:
What is the percentage of females in this study?
Answer:
Based on the dataset, there are no females recorded, as the percentage of females is 0.00%.
This didn’t work, but why? Choose Details to find the SQL statement that was generated:
SELECT round(100.0 * sum(case when "sex" = 'female' then 1 else 0 end) / count(*), 2) as "Percentage of Females" FROM public.donor;
Copy the SQL statement and go back to Amazon Redshift. Paste the entire SELECT statement into the script window, select the entire command, and choose Run. It returns 0 (which is the same as Amazon Bedrock). To find out why, compare the command with the first few rows of data. Copy and paste this entire command into the script editor and choose Run:
SELECT *
FROM dev.public.donor
LIMIT 10;
Compare the columns in the Amazon Redshift table with the command. In the command, you’re looking for female, yet the rows have either Female or Male. Case matters! To address the issue, you can lowercase the entire column with the following command and update the table to keep the change:
UPDATE donor SET sex = LOWER(sex);
Check the first ten rows again:
SELECT *
FROM dev.public.DONOR
LIMIT 10;
Note that the values are now female
or male
.
To update the table to keep the change, go to Bedrock, Knowledge Bases, aida- structured-data knowledge base, and to the right of Query Engine choose Sync and wait for the sync to complete before continuing. To ask the question again, choose Test knowledge base and choose a model as you did previously. Reenter the question, and you should receive the answer that the percentage of females in the study is 53.96%. Choose Details to learn the command that was generated:
SELECT round(100.0 * sum(case when "sex" = 'female' then 1 else 0 end) / count(*), 2) as "Percentage of Females" FROM public.donor;
Cleanup
To avoid incurring future charges, you need to delete the knowledge base. Follow these steps:
- On the Amazon Bedrock console under Build, select the aida-structured-data knowledge base
- Choose Delete and then type
delete
in the window.
Conclusion
The Human Cell Atlas is a global consortium that is mapping every cell type in the human body, creating a three-dimensional atlas of human cells to transform our understanding of biology and disease. The atlas is likely to lead to major advances in the way illnesses are diagnosed and treated. The goal of the atlas is to build comprehensive reference maps of all human cells—the fundamental units of life—as a basis for understanding basic human biological processes to diagnose, monitor, and treat disease. Its goal is to help scientists understand how genetic variants impact disease risk, define drug toxicities, discover better therapies, and advance regenerative medicine. The first phase of this ambitious project brings together a suite of flagship projects in key tissues, systems, and organs, including the lung, heart, liver, and immune system. An open global initiative, the Human Cell Atlas Consortium was founded in 2016 and has grown to more than 3,900 HCA members from over 1,700 institutes and more than 100 countries around the world.
The Registry of Open Data on AWS contains over 700 datasets (with more than 180 life sciences datasets) available to the public that can be used to add additional context to a FM. Amazon Bedrock now supports using public datasets in the Registry of Open Data on AWS, so you don’t have to maintain a copy of the dataset. Check out the registry to learn if there are datasets available for you to use with your next project.
Resources
- Turning data into a knowledge base in the Amazon Bedrock User Guide
- For supported datatypes in Amazon Bedrock, visit Prerequisites for your Amazon Bedrock knowledge base data in the Amazon Bedrock User Guide
- Learn more about Open Data on AWS
- Find data in the Registry of Open Data on AWS