66 new or updated datasets available on the Registry of Open Data on AWS

The AWS Open Data Sponsorship Program makes high-value, cloud-optimized datasets publicly available on Amazon Web Services (AWS). AWS works with data providers to democratize access to data by making it available to the public for analysis on AWS, develop new cloud-based techniques, formats, and tools that lower the cost of working with data, and encourage the development of communities that benefit from access to shared datasets. Through the AWS Open Data Sponsorship Program, customers are making over 300 PB of high-value, cloud-optimized data available for public use.

All publicly available datasets can be found in the Registry of Open Data on AWS and are now also discoverable on Exchange. This quarter, AWS released 66 new or updated datasets.

What are people currently doing with AWS Open Data?

Radboud University Medical Center Nijmegen (Radboudumc) is running AI algorithms for 100,000 users securely combining private data with public datasets made available through the Registry of Open Data on AWS.
The Open Brain Institute (OBI) recently launched as a new non-profit organization, with the goal of transforming neuroscience from the physical to the virtual world. OBI will be managing EPFL’s Blue Brain Dataset going forward—empowering researchers to build and simulate digital brains with unprecedented detail, scale, and speed.
A recent study published in Nature Methods illustrates the value of the SG-NEx dataset in the development and benchmarking of computational methods for profiling complex transcriptional events at isoform-level resolution.
Open Ceda is now using AWS Open Data for sustainability-related uses as part of the Amazon Sustainability Data Initiative (ASDI).
AWS Open Data is now in the dataset Library for the National University Library as well as the University of the Cumberlands Research Library.
Beginning in 2023, NASA launched its Year of Open Science and Open Science Initiative in collaboration with the AWS Open Data program. As part of this initiative, HEASARC (High Energy Astrophysics Science Archive Research Center) data is now available in the cloud, increasing data accessibility for the broader community and enabling science that requires significant cloud computing resources.
Capella Space, which is Capella’s constellation of synthetic aperture radar (SAR) satellites, delivers 24/7 all-weather Earth observation with the ability to penetrate atmospheric conditions, offering near real-time visibility even in cloud-covered areas, day or night through the Registry of Open Data on AWS.

Workshops and Tutorials on leveraging Open Datasets

Our blog on using AWS Open Data in Amazon Bedrock shows how to use open data as a knowledge base in Bedrock applications. The post discusses how you can make technical information, like precipitation and snow depth, available to a set of users that might not be comfortable with SQL commands or other tools commonly used to search these types of data. Now nontechnical decision-makers can have access to highly technical data in an accessible and understandable format through a chat-based assistant.
At the recent Human Cell Atlas General Meeting in Singapore, we launched a workshop on Single-cell Omics on the Open Data Program, teaching researchers how to analyze single-cell data from public datasets using AWS HealthOmics. The workshop also demonstrates how to use public datasets to build a knowledge base and analyze datasets through Amazon Bedrock. As a part of this workshop, we released a Jupyter notebook—Accessing AWS Open Data Using Boto3—that demonstrates how to programmatically access and analyze datasets with Python’s boto3 library.
Similarly, the “Working with NOAA satellite data in the AWS Open Data Program” workshop shows how to visualize forest fire detection data using Amazon SageMaker. These intermediate-level tutorials guide researchers and data scientists through accessing, processing, and analyzing open datasets on AWS, demonstrating how to use cloud services effectively for scientific research.
The AWS Open Data team has published three how-to guides to help users work with open datasets, all available in the aws-opendata-samples GitHub repository. These include:
- Migrating Large Datasets to Amazon S3 Using Rclone, which explains how to efficiently perform a server-side copy between Amazon S3 buckets using Rclone.
- Onboard to the Open Data Program and Set Up Update Notifications, which provides manual steps for data providers to join the program and configure notifications
- Monitoring Amazon S3 Dataset Usage with Server Access Logs and CloudWatch Metrics, which helps onboarded data providers understand storage and request activity.

What can you build with these datasets?

Brain Encoding Response Generator (BERG)

The Brain Encoding Response Generator (BERG) dataset from University of California, Berkeley provides comprehensive brain encoding responses, offering researchers valuable data for neuroscience research and artificial intelligence applications in understanding human brain activity patterns.

Brain Encoding Response Generator (BERG) joins 65 other new or updated datasets on the Registry of Open Data in the following categories.

Climate and weather

Geospatial

Life sciences

Metagenomic reference libraries for Slacken from University of Copenhagen
Brain/MINDS Marmoset Connectivity Resource on AWS from RIKEN Center for Brain Science
DHARANI Developing Human-Brain Atlas from Indian Institute of Science
Human Cell Atlas from Human Cell Atlas
MetaGraph Sequence Indexes from University of Helsinki
CZ Grand Challenges – Imaging MIT Licensed data and models from Chan Zuckerberg Initiative
CZ Grand Challenges – Transcriptomic MIT Licensed data and models from Chan Zuckerberg Initiative
CZ Grand Challenges – Imaging BSD licensed data and models from Chan Zuckerberg Initiative
CZ Grand Challenges – Model Benchmarking from Chan Zuckerberg Initiative
Knowledge Portal Network Bottom-line Genetic Associations from Broad Institute
CHIMERA from Radboud University Medical Center
OpenWings OpenData from OpenWings
Steinegger Lab Datasets from Seoul National University
NIH Roadmap Epigenomics from NIH Roadmap Epigenomics Mapping Consortium, Ting Wang Lab at WashU
SPARC from The SPARC Data and Resource Center
Oceanomics from Minderoo Foundation
EEGDash from Swartz Center for Computational Neuroscience
AllTheBacteria from European Bioinformatics Institute
1KG-ONT-VIENNA panel from Institute of Molecular Pathology

Machine learning

How can you make your data available?

Looking to make your data available? The AWS Open Data Sponsorship Program covers the cost of storage for publicly available high-value, cloud-optimized datasets. We work with data providers who seek to:

Democratize access to data by making it available for analysis on AWS
Develop new cloud-native techniques, formats, and tools that lower the cost of working with data
Encourage the development of communities that benefit from access to shared datasets

Learn how to propose your dataset to the AWS Open Data Sponsorship Program.

Learn more about open data on AWS.

AWS Public Sector Blog

66 new or updated datasets available on the Registry of Open Data on AWS

What are people currently doing with AWS Open Data?

Workshops and Tutorials on leveraging Open Datasets

What can you build with these datasets?

Climate and weather

Geospatial

Life sciences

Machine learning

How can you make your data available?

Resources

Follow

Learn

Resources

Developers

Help