AWS Public Sector Blog
66 new or updated datasets available on the Registry of Open Data on AWS
The AWS Open Data Sponsorship Program makes high-value, cloud-optimized datasets publicly available on Amazon Web Services (AWS). AWS works with data providers to democratize access to data by making it available to the public for analysis on AWS, develop new cloud-based techniques, formats, and tools that lower the cost of working with data, and encourage the development of communities that benefit from access to shared datasets. Through the AWS Open Data Sponsorship Program, customers are making over 300 PB of high-value, cloud-optimized data available for public use.
All publicly available datasets can be found in the Registry of Open Data on AWS and are now also discoverable on Exchange. This quarter, AWS released 66 new or updated datasets.
What are people currently doing with AWS Open Data?
- Radboud University Medical Center Nijmegen (Radboudumc) is running AI algorithms for 100,000 users securely combining private data with public datasets made available through the Registry of Open Data on AWS.
- The Open Brain Institute (OBI) recently launched as a new non-profit organization, with the goal of transforming neuroscience from the physical to the virtual world. OBI will be managing EPFL’s Blue Brain Dataset going forward—empowering researchers to build and simulate digital brains with unprecedented detail, scale, and speed.
- A recent study published in Nature Methods illustrates the value of the SG-NEx dataset in the development and benchmarking of computational methods for profiling complex transcriptional events at isoform-level resolution.
- Open Ceda is now using AWS Open Data for sustainability-related uses as part of the Amazon Sustainability Data Initiative (ASDI).
- AWS Open Data is now in the dataset Library for the National University Library as well as the University of the Cumberlands Research Library.
- Beginning in 2023, NASA launched its Year of Open Science and Open Science Initiative in collaboration with the AWS Open Data program. As part of this initiative, HEASARC (High Energy Astrophysics Science Archive Research Center) data is now available in the cloud, increasing data accessibility for the broader community and enabling science that requires significant cloud computing resources.
- Capella Space, which is Capella’s constellation of synthetic aperture radar (SAR) satellites, delivers 24/7 all-weather Earth observation with the ability to penetrate atmospheric conditions, offering near real-time visibility even in cloud-covered areas, day or night through the Registry of Open Data on AWS.
Workshops and Tutorials on leveraging Open Datasets
- Our blog on using AWS Open Data in Amazon Bedrock shows how to use open data as a knowledge base in Bedrock applications. The post discusses how you can make technical information, like precipitation and snow depth, available to a set of users that might not be comfortable with SQL commands or other tools commonly used to search these types of data. Now nontechnical decision-makers can have access to highly technical data in an accessible and understandable format through a chat-based assistant.
- At the recent Human Cell Atlas General Meeting in Singapore, we launched a workshop on Single-cell Omics on the Open Data Program, teaching researchers how to analyze single-cell data from public datasets using AWS HealthOmics. The workshop also demonstrates how to use public datasets to build a knowledge base and analyze datasets through Amazon Bedrock. As a part of this workshop, we released a Jupyter notebook—Accessing AWS Open Data Using Boto3—that demonstrates how to programmatically access and analyze datasets with Python’s boto3 library.
- Similarly, the “Working with NOAA satellite data in the AWS Open Data Program” workshop shows how to visualize forest fire detection data using Amazon SageMaker. These intermediate-level tutorials guide researchers and data scientists through accessing, processing, and analyzing open datasets on AWS, demonstrating how to use cloud services effectively for scientific research.
- The AWS Open Data team has published three how-to guides to help users work with open datasets, all available in the aws-opendata-samples GitHub repository. These include:
- Migrating Large Datasets to Amazon S3 Using Rclone, which explains how to efficiently perform a server-side copy between Amazon S3 buckets using Rclone.
- Onboard to the Open Data Program and Set Up Update Notifications, which provides manual steps for data providers to join the program and configure notifications
- Monitoring Amazon S3 Dataset Usage with Server Access Logs and CloudWatch Metrics, which helps onboarded data providers understand storage and request activity.
What can you build with these datasets?
Brain Encoding Response Generator (BERG)
The Brain Encoding Response Generator (BERG) dataset from University of California, Berkeley provides comprehensive brain encoding responses, offering researchers valuable data for neuroscience research and artificial intelligence applications in understanding human brain activity patterns.
Brain Encoding Response Generator (BERG) joins 65 other new or updated datasets on the Registry of Open Data in the following categories.
Climate and weather
- Met Office Global and Regional Ensemble Prediction System – UK (MOGREPS-UK) on a 30-day rolling archive from Met Office
- Met Office Global Wave model on a 2-year rolling archive from Met Office
- Danish Meteorological Institute (DMI) Open Data Forecasts from Danish Meteorological Institute
- NOAA Global Forecast System (GFS) netCDF Formatted Data from National Oceanic and Atmospheric Administration (NOAA)
- NOAA Unified Forecast System (UFS) Coastal Model from NOAA
- NOAA FourCastNet Global Forecast System (FourCastNetGFS) from NOAA
- NOAA HYSPLIT-compatible meteorological data archives from NOAA
- U.S. Environmental Protection Agency (EPA) Center for Computational Toxicology and Exposure High Throughput Transcriptomics Data from U.S. Environmental Protection Agency
- IWMI DIWASA Blue ET for Africa from International Water Management Institute (IWMI)
- IWMI DIWASA Green ET for Africa from IWMI
- Open CEDA by Watershed from Centre for Environmental Data Analysis
- NOAA Space Weather Follow-On Mission Geostationary Operational Environmental Satellite (GOES) 19 from National Oceanic and Atmospheric Administration (NOAA)
- Satellite – Ocean Colour – NOAA200 – 1 day – Chlorophyll-a concentration (GSM model), (OC3 model), and (OCI model) from AODN
- Satellite – Ocean Colour – SNPP – 1 day – Chlorophyll-a concentration (GSM model), (OC3 model), and (OCI model) from AODN
- Satellite – Ocean Colour – NOAA20 – 1 day – Diffuse attenuation coefficient (k490), SNPP – 1 day – Diffuse attenuation coefficient (k490), and MODIS – 1 day – Diffuse attenuation coefficient (k490) from AODN
- Satellite – Ocean Colour – MODIS – 1 day – Chlorophyll-a concentration (Carder model), (GSM model), (OC3 model), and (OCI model) from AODN
- Satellite – Ocean Colour – MODIS – 1 day – Nanoplankton fraction (OC3 model and Brewin et al 2012 algorithm) from AODN
- Satellite – Ocean Colour – MODIS – 1 day – Net Primary Productivity (GSM model and Eppley-VGPM algorithm) and (OC3 model and Eppley-VGPM algorithm) from AODN
- Satellite – Ocean Colour – MODIS – 1 day – Optical Water Type (Moore et al 2009 algorithm) from AODN
- Satellite – Ocean Colour – MODIS – 1 day – Picoplankton fraction (OC3 model and Brewin et al 2012 algorithm) from AODN
- Satellite – Sea surface temperature – Level 3 – Single sensor – Himawari-8 – 1 day – Night time from AODN
- Satellite – Sea surface temperature – Level 3 – Single sensor – 1 day – Day and night time – Southern Ocean from AODN
- Satellite – Altimetry calibration and validation from AODN
- Ocean Radar – Capricorn bunker group site – Wave – Delayed mode and Wind – Delayed mode from Australian Ocean Data Network (AODN)
- Ocean Radar – Coffs Harbour site – Wave – Delayed mode and Wind – Delayed mode from AODN
- Ocean Radar – Rottnest shelf site – Wave – Delayed mode and Wind – Delayed mode from AODN
- Ocean Radar – South Australian gulfs site – Sea water velocity – Delayed mode, Wave – Delayed mode, and Wind – Delayed mode from AODN
- Satellite – Sea surface temperature – Level 3 – Multi sensor – 1 day – Day and night time, 3 day, 6 day, and 1 month from AODN
- Ships of Opportunity – Expendable bathythermographs – Delayed Mode from AODN
- GRAF Reforecast from The Weather Company
- Marginal Build Emissions Rates (MBERs) for Electricity from Climate TRACE
Geospatial
- State of Colorado Imagery from State of Colorado
- Clay Model v0 Embeddings from MadeWithClay
- Clay v1.5 NAIP-2 from MadeWithClay
- Clay v1.5 Sentinel-2 from MadeWithClay
- Ocean Biodiversity Information System (OBIS) species occurrence data from Intergovernmental Oceanographic Commission of UNESCO
- Euclid Quick Release 1 (Q1) from European Space Agency
- Canopy Tree Height Map for the Amazon Forest (mean height composite 2020-2024) by CTrees.org from CTrees
- Wildfire Projections to Support Climate Resilience from U.S. Geological Survey
- Copernicus Global Land Service – Lake Water Quality from Digital Earth Africa
- TESS-GAIA Light Curve (TESS) from NASA
- gsm Landsat ARD from Global Land Analysis and Discovery Lab
- SPHEREx Quick Release (QR): An All-Sky Spectral Survey from IRSA
Life sciences
- Metagenomic reference libraries for Slacken from University of Copenhagen
- Brain/MINDS Marmoset Connectivity Resource on AWS from RIKEN Center for Brain Science
- DHARANI Developing Human-Brain Atlas from Indian Institute of Science
- Human Cell Atlas from Human Cell Atlas
- MetaGraph Sequence Indexes from University of Helsinki
- CZ Grand Challenges – Imaging MIT Licensed data and models from Chan Zuckerberg Initiative
- CZ Grand Challenges – Transcriptomic MIT Licensed data and models from Chan Zuckerberg Initiative
- CZ Grand Challenges – Imaging BSD licensed data and models from Chan Zuckerberg Initiative
- CZ Grand Challenges – Model Benchmarking from Chan Zuckerberg Initiative
- Knowledge Portal Network Bottom-line Genetic Associations from Broad Institute
- CHIMERA from Radboud University Medical Center
- OpenWings OpenData from OpenWings
- Steinegger Lab Datasets from Seoul National University
- NIH Roadmap Epigenomics from NIH Roadmap Epigenomics Mapping Consortium, Ting Wang Lab at WashU
- SPARC from The SPARC Data and Resource Center
- Oceanomics from Minderoo Foundation
- EEGDash from Swartz Center for Computational Neuroscience
- AllTheBacteria from European Bioinformatics Institute
- 1KG-ONT-VIENNA panel from Institute of Molecular Pathology
Machine learning
- Indian High Court Judgments from Supreme Court of India
- AG-LOAM Dataset from ETH Zurich
- 2020 Redistricting Data File Least Squares Estimates from United States Census Bureau
How can you make your data available?
Looking to make your data available? The AWS Open Data Sponsorship Program covers the cost of storage for publicly available high-value, cloud-optimized datasets. We work with data providers who seek to:
- Democratize access to data by making it available for analysis on AWS
- Develop new cloud-native techniques, formats, and tools that lower the cost of working with data
- Encourage the development of communities that benefit from access to shared datasets
Learn how to propose your dataset to the AWS Open Data Sponsorship Program.