AWS Public Sector Blog
How Truth For Life transformed its viewer analytics while optimizing costs
Nonprofit organizations face many unique challenges in their operations. Among these is the challenge of sustaining an operational budget. It’s vital that nonprofits find creative ways to promote growth and maximize the value of every dollar they spend. In this post, you’ll see how the nonprofit Bible ministry, Truth For Life (TFL), built a cost-optimized viewer analytics pipeline on Amazon Web Services (AWS) to gain new insights on their followers and help develop new growth strategies.
TFL’s self-declared mission is “to teach the Bible with clarity and relevance so that unbelievers will be converted, believers will be established, and local churches will be strengthened.” They publish daily Bible teachings on the radio, YouTube, podcasts, their mobile application, and more. Their mission is driven by maximizing their audience growth and donor engagement. These focuses share a common need to collect and understand as much data as possible about TFL’s viewers and donors.
For over a decade, TFL has hosted their static media assets on Amazon Simple Storage Service (Amazon S3) and performed server access logging on them. For the bulk of this time, they exported their log data to a third-party solution for analysis and visualization. This lightweight and low-effort option catered nicely to TFL’s limited on-hand technical staff and tight budget. However, this solution’s options were sparse when it came to visual customization and the ability to integrate additional data sources. It became clear that a significant gap existed in the value of their log data when sent to the third-party solution versus the value it could have if they had more control over how it was organized and presented. This is where TFL turned back to AWS to explore new data visualization possibilities. They needed a way to take greater control of their log analytics while minimizing any increases to their operating costs.
Solution overview
To unlock this unattained value from their data, TFL migrated their data visualization workload to Amazon QuickSight. The solution starts in Amazon S3, where TFL hosts their static media as objects in a dedicated bucket. On this bucket, they configured server access logging, which generates logs that detail all of the requests made to the hosting bucket. These logs contain data including the source bucket’s name, source object key, requester’s IP, REST operation, request latency, and more. Upon collection, the logs are stored in a separate dedicated logging bucket, and TFL enabled date-based partitioning. Enabling this feature means the object key for each log file includes partitions based on year, month, and day as follows:
[DestinationPrefix][SourceAccountId]/[SourceRegion]/[SourceBucket]/[YYYY]/[MM]/[DD]/[YYYY]-[MM]-[DD]-[hh]-[mm]-[ss]-[UniqueString]
Next, TFL uses AWS Glue and Amazon Athena to index and query their logging bucket daily. QuickSight ingests the results of this query using an Athena connection. TFL enriches their dataset with supplemental information such as geographic IP mappings and social media statistics. This data is used to populate a QuickSight dashboard, which is viewed and explored by TFL’s marketing and analytics teams. The architectural design is straightforward. However, its key lies in careful decision-making to allow for cost-optimization wherever possible.
The solution, shown in the following diagram, has two S3 buckets, one for media hosting, the other for storing logs. AWS Glue stores metadata about the logging data. Athena uses the metadata to perform daily SQL queries. QuickSight ingests the data from each query and appends it to a QuickSight dataset. TFL employees visualize and analyze this data with a QuickSight dashboard.
Data collection strategy
Well-strategized data collection practices are vital to cost optimization throughout this architecture. As mentioned, Amazon S3 access logs include an option to use date-based partitioning. TFL used this setting to have log delivery partitioned by year, month, and date. In the AWS console, this appears as a file hierarchy within an S3 bucket first separated into folders by year, then month folders within each year, and date folders within each month. As a result, AWS Glue and Athena can drastically reduce the amount of data parsed each day by only scanning data that falls within a given partition (as opposed to scanning the entirety of their log archive).
Preparing and ingesting data into Amazon QuickSight
After collecting the server access logs, TFL needed to load their data into QuickSight. Simultaneously, they had to consider the implications of ingesting a steadily growing terabyte-scale dataset. Firstly, as you scan greater amounts of data for a given refresh, you have to pay more for that refresh. This could take the form of Amazon S3 API charges, Athena scanning charges, and more (all depending on your chosen data source). Secondly, the larger a dataset becomes, the greater the chance of errors and timeouts. Naturally, these two factors meant performing a complete refresh of the dataset was impractical. To begin addressing all of this, TFL chose Athena as their QuickSight data source because it can ingest data through custom SQL queries. This method has the following added benefits:
- Ability to limit the amount of data selected for ingestion
- Incremental refreshes to QuickSight datasets
- Ability to manipulate and join datasets with SQL
- Dynamic logic for data ingestion (that is, querying data from a rolling window based on the current date)
- Easier adjustments compared to alternatives such as using an Amazon S3 manifest file
When defining a new data source in QuickSight, you can configure a refresh schedule to reload your dataset with any updates to the underlying data source. In this solution, TFL configured incremental refreshes, which are only possible when using a SQL-based data source. In an incremental refresh, QuickSight runs a predefined SQL query that scans data from a look-back window (for example, the past 7 days). QuickSight then deletes any preexisting SPICE (Super-fast, Parallel, In-memory Calculation Engine) data from the previous 7 days and replaces it with the newly scanned information. This solves TFL’s refresh timeout because all new data and updates to recent data are captured, but the older majority of data is left alone and never rescanned.
Conclusion
TFL realized benefits with their new media analytics pipeline quickly after its completion. Thanks to the highly customizable data curation abilities of QuickSight, TFL joined their base data with geographical IP-mapping data to learn about the locations of their viewers. This revealed that their media is viewed in over 220 distinct countries and territories (such as Puerto Rico) across the world—a massive leap beyond what they originally believed to be their reach. Discoveries like these have served as an inspiration for TFL to develop new marketing and social outreach efforts that encompass a global scope. TFL’s success with QuickSight serves as a wonderful example for businesses and organizations, regardless of type and size, that want to use their data in new ways to further their mission.
Interested in learning more about QuickSight? See the official QuickSight documentation and check out Getting Started on the QuickSight Community website.