Networking & Content Delivery
Tracking Pixel driven web analytics with AWS Edge Services: Part 1
Being able to analyze web traffic and user behavior is essential to understanding the impacts of new features, content updates, or current product iterations for websites and applications. Tracking website activity can provide insight into who visits your website, where they come from, and what content they view. A web beacon is a common technique used to track user behavior. By embedding a small piece of HTML, a company can increase visibility into how users interact with its products. Common metrics tracked by web beacons are events-per-hour, visitor count, user agents, abnormal events, aggregate event count, referrers, and recent events. Although there are many types of HTML elements that can be used as beacons, images were the first web beacons used and are the focus of this post.
A tracking pixel consists of using a 1×1 pixel image to leverage the image loading call to send the tracking information to a backend server. Instead of using a traditional JavaScript API call, the information is sent in the parameters of the image GET request and on the HTTP headers themselves, making it possible to include it in any component supporting HTML, like a webpage or even an email.
We covered building a tracking pixel in our post Build a serverless tracking pixel solution in AWS. That post covers how to use a serverless architecture built around Amazon API Gateway and AWS Lambda. If you are familiar with serverless architectures built with API Gateway, then we recommend looking at that solution. This post focuses on building a pixel tracking using AWS edge services with Amazon CloudFront. This offers an easier lift if you are already using CloudFront as your content delivery network (CDN). Customers using CloudFront and can add this capability to their existing CloudFront distribution and minimize the introduction of additional services to their current architecture.
Architecture overview
The architecture for the pixel tracking solution starts with CloudFront, which is a CDN service built for high performance and security. A 1×1 pixel image is stored in an Amazon Simple Storage Service (Amazon S3) bucket. CloudFront caches and serves the pixel image from the S3 bucket. When creating a CloudFront distribution, an origin is specified where CloudFront sends requests for files. A distribution can use several different kinds of origins. Using Amazon S3 as an origin for the distribution has CloudFront deliver files stored in the S3 bucket. When a request is made to fetch the pixel image, the data identified is passed with the request. The information on this request is collected and stored by CloudFront real-time logs. You can use real-time logs to monitor, analyze, and take action on requests made to a distribution. To configure real-time logs, set the sampling rate to 100% to include every viewer request and specify the fields to log, such as c-ip, timestamp, cs-uri-query, etc. The sampling rate is used to control the percentage of log records to be received. For a pixel tracking use-case, 100% should be set. Then, real-time log configuration is attached to the distribution’s cache behavior.
With real-time logs, you can customize the information collected and where it gets delivered. The real-time logs are integrated with Amazon Kinesis Data Streams to enable delivery of these logs to a generic HTTP endpoint using Amazon Kinesis Data Firehose. Amazon Kinesis Data Streams allows for ingestion of large streams of data in a robust way. Kinesis collects the Cloudfront real-time logs, allowing for further processing. A Kinesis data stream is a set of shards that stores records for 24 hours by default, up to 365 days. Each shard ingests up to a maximum of 2 MB per second and emits up to 1,000 records per second, up to a total amount of 1 MB per second. For example, using 30 shards would ingest at peak 60 MB per second and emit 30,000 records per second at 30 MB per second. Kinesis Data Stream is configured for one consumer, Kinesis Data Firehose.
Amazon Kinesis Data Firehose is a streaming ETL service used as the easiest way to load streaming data into data stores and analytics tools. The streaming information is buffered to 15 minutes to consolidate the information into fewer files to store in the data lake. This helps minimize the costs of storage and future queries. Data lakes help break down data silos to maximize end-to-end data insights. Amazon S3 is the best place to build data lakes because of its unmatched durability, availability, scalability, security, compliance, and audit capabilities. A data catalog is used to provide the search and query capabilities of the data stored in the data lake. AWS Glue is a data integration service that makes it easier to discover, prepare, move, and integrate data. One of the components of Glue is the Glue Data Catalog, which is a central metadata repository. Glue crawlers are used to automatically create the schema tables of the catalog. Amazon Athena is integrated out of the box with Data Catalog and is used to analyze the cataloged tracking data. Athena provides a simplified, flexible way to analyze petabytes of data where it lives.
Cost optimization
Data transfer out (DTO) for CloudFront is free for origin fetches from any AWS origin. Using Amazon S3 to host the pixel image ensures no charges are incurred when fetching the image. A further improvement that can be made is adding aggregation. Aggregation refers to the storage of multiple records in a Kinesis data stream record, and it increases the number of records sent per API call and increases producer throughput. This is a cost optimization improvement because, at the time of this publication, Kinesis Data Firehouse ingestion pricing is tiered and billed per GB ingested in 5KB increments. You can use Kinesis Producer Library (KPL) to write data to a Kinesis data stream with aggregation. KPL lets you write to one or more Kinesis streams with automatic and configurable retry logic and aggregate user records to increase payload size and throughput. When using this data stream as the source to Kinesis Data Firehouse delivery stream, Kinesis Data Firehose de-aggregates the records before it delivers them to the destination.
KPL is currently only available as a Java API wrapper around a C++ executable, which may not be suitable for all deployment environments. Amazon Kinesis aggregation and de-aggregation modules support Java, Python, and Node.js, and they let you publish aggregated user records using Lambda. CloudFront real-time logs are ingested by Kinesis data stream unaggregated. Lambda integrates natively with Kinesis Data Streams. Kinesis invokes the Lambda function, which uses the aggregation module to combine the Kinesis record. The Lambda function reads from the LATEST shard iterator, processing incoming records, and combining them into the Kinesis user record. The Lambda function can be configured to produce a user record that contains roughly 5KB of Kinesis records. Then, the Lambda function publishes to another Kinesis Data Stream and the data is compatible with consumers using the Kinesis Client Library (KCL) or de-aggregation modules.
Potential use cases
Advertising Technology (Ad Tech)
Ad tech exchange servers track billable beacons through a combination of data collection, tracking mechanisms, and reporting systems. Cloudfront real-time logs can help you monitor and analyze the incoming beacons in real-time, providing valuable insights into ad performance and enabling prompt decision-making. Ad tech platforms collect data from various sources, such as publishers, advertisers, and third-party providers, including user behavior, ad impressions, clicks, conversions, and other relevant metrics. Tracking mechanisms like cookies, pixels, tags, or JavaScript code embedded within web pages or mobile apps are used to monitor user interactions with ads. Furthermore, when a user interacts with an ad, a beacon is sent to the ad tech exchange server.
Personal blog
A personal blog would leverage pixel tracking in the way demonstrated in this post for similar reasons to ad tech, but at a smaller scale and more personal level. A person could track activities across particular blog topics or have a deeper understanding of their viewers’ geographic locations to drive their content selection. If more of their users are consuming their content on a mobile platform, then perhaps it makes more sense to tailor blogging toward more short-form content to be easily digested on-the-go. On the smaller scale of a personal blog, each impression is inherently more valuable than impressions at scale, and using pixel tracking to squeeze every insight possible out of these impressions can be a differentiator for growth.
E-Commerce
With the objective to ultimately getting a user to make a sale, e-commerce platforms that have a better understanding of the online patterns and habits of their users can drive increased sales. Is the front page not recommending pertinent items and then users end their browsing session? Are users abandoning their cart at a specific step in the sales cycle? Answering these questions through the analysis of information collected through tracking pixels can create a more integrated and streamline online shopping experience for customers.
Conclusion
In this post, we showed how you can replace web beacon servers using AWS edge services with CloudFront. AWS edge services improve performance by moving compute, data processing, and storage closer to end-user devices. CloudFront then leverages AWS analytical services Kinesis Data Streams, Kinesis Data Firehose, Amazon S3, Glue Data Catalog, and Athena to enable insight into website activity. This solution’s usage of managed and serverless services makes it advantageous over a traditional beacon server by providing automatic scaling and cost savings, with a pay-for-use billing model. If you are interested in learning other ways to customize at the edge, then checkout out our documentation on customizing with edge functions.
In Part 2 of this series, we demonstrate how to implement the pixel tracking solution with CloudFront real-time logs.




