GenAI in Factor Modeling Data Pipelines: A Hedge Fund Workflow on AWS

Introduction

Factor modeling in hedge funds is a quantitative approach that identifies and analyzes key drivers of asset returns. It lets fund managers to optimize portfolios, manage risk, and generate alpha by leveraging vast amounts of market data. This technique facilitates the development of sophisticated trading strategies, enhancing investment performance and providing a competitive edge in the financial markets.

This post explores how integrating AWS serverless patterns and GenAI services create a robust factor modeling pipeline to address these challenges. We delve into the technical aspects of workflow implementation, providing GitHub code samples you can quickly deploy or modify to find the right factors. Our target audience includes quant developers seeking to enhance their firm’s computational capabilities and portfolio managers looking to leverage alternative data for alpha generation.

The classic book Quantitative Equity Portfolio Management: An Active Approach to Portfolio Construction and Management highlights the challenges in factor modeling. Manual factor identification and calculation across thousands of securities is not only time-consuming but also prone to errors and constrained by computational limitations. As datasets expand to include alternative sources, like market news and unstructured financial documents, scalability becomes a critical issue.

By leveraging cloud services to build a factor modeling platform, investment firms streamline back-testing processes, extract nuanced signals from textual data, and adapt swiftly to changing market conditions. This approach enables quant teams to focus on model development rather than infrastructure management, facilitating faster iteration and deployment of investment strategies. Ultimately, this modern, cloud-native factor modeling platform empowers financial professionals to make informed, data-driven investment decisions and enhance portfolio performance.

Solution overview

Our solution presents an end-to-end data processing application for quantitative finance factor modeling. This architecture helps hedge funds and quantitative analysts to identify and quantify underlying drivers of asset returns through a combination of financial data and social media sentiment analysis. When the automated processing steps complete, they produce the output factors for portfolio construction, risk management, and trading strategy development.

The following diagram illustrates the architecture and workflow of the proposed solution:

Figure 1 Factor modeling data pipeline and factor mining reference architecture

The following highlights the key components.

1 Serverless Data Collection

1.1 Yahoo finance market data collection
Hedge funds rely heavily on market data for their trading strategies and risk management. The AWS Lambda function daily tick data uses Yahoo finance to download the daily Open, High, Low, Close, and Volume (OHLCV) data. Adjust this function to select your own data vendor for market data.

There are cases where market data vendors and brokers require clients to provide static IP addresses for their allow-lists. NAT Gateways provide a consistent static source IP address for outbound traffic to meet vendors’ static IP requirements.

1.2 Web search by Tavily
Hedge funds are embracing alternative data and GenAI to gain a competitive edge. GenAI’s advanced text processing capabilities make it ideal for analyzing diverse, unstructured data sources. This powerful combination empowers funds to uncover hidden patterns, anticipate market trends, and make informed investment decisions, potentially leading to superior returns.

Tavily offers AI-powered web search, allowing for targeted retrieval of news, analyst reports, and other web content relevant to factor modeling. The function web search uses Tavily’s API to search the news related to stocks. After you get the news, the framework uses the following prompt to generate sentiments for the stock market news:

Analyze the sentiment of the following text about a company's stock and financial performance.

Rate the sentiment on a scale from -1 to 1, where:
- -1 represents extremely negative sentiment
- 0 represents neutral sentiment
- 1 represents extremely positive sentiment

Only respond with a single number between -1 and 1, with up to two decimal places. No need explanation.

Text to analyze:
{text}

You can engineer the prompt based on your factor requirements.

1.3 SEC filing retrieval
SEC filings are crucial documents that public companies must submit to the U.S. Securities and Exchange Commission (SEC). Two key filings are, 10-K Annual report, which providing a comprehensive overview of the company’s financial condition, including audited financial statements. The second is the 10-Q Quarterly report, with unaudited financial statements and operational updates. These filings contain valuable data for factor modeling, such as financial ratios, revenue breakdowns, etc.

The Lambda function fetch SEC uses SEC’s EDGAR API to download SEC filings in JSON format. You can schedule these serverless functions to run periodically, fetching the latest filings for companies of interest. AWS Lambda’s ability to scale automatically makes it ideal for handling varying loads, especially during peak filing seasons.

1.4 Financial report processing
There are unstructured data points not included in the EDGAR API response. We can upload financial report PDF files to Amazon Simple Storage Service (Amazon S3) with that information, such as CEO statements, ESG initiatives, and strategic priorities. When you upload a file to Amazon S3, S3’s event notifications trigger Lambda functions. The Lambda function financial report processor uses a prompt to extract unstructured data from financial reports. For example:

As an experienced CFA and FRM holder, please analyze the attached annual report and extract concise summaries (2-3 sentences each) for the following key factors:

1. CEO statement - Focus on strategic vision, major achievements, and forward-looking statements
2. ESG initiatives - Highlight environmental sustainability efforts, social responsibility programs, and governance improvements
3. Market trends and competitive landscape - Identify industry shifts, market position changes, and competitive advantages/challenges
4. Risk factors - Extract the most significant financial, operational, and strategic risks facing the company
5. Strategic priorities - Summarize key growth initiatives, investment areas, and long-term business objectives

For each factor, provide:
- The most important points using specific data when available
- Any significant changes from the previous year
- A performance rating (0-10 scale) with brief justification based on industry benchmarks and year-over-year progress

Here is the financial report text:
<report>
{text[:100000]}  
</report>

Please output in the following JSON format:
{{
"items": [
{{
"category": "CEO statement",
"summary": "...",
"key_data": ["...", "..."],
"year_over_year_change": "...",
"rating": 8,
"rating_justification": "..."
}},
...
],
"overall_assessment": "...",
"investment_recommendation": "..."
}}

2 Data storage with OLAP database

Our reference implementation uses ClickHouse columnar database for storing factor modeling values and results. This type of database is optimized for analytical workloads and handles large volumes of structured data. However, the data layer can use different technologies based on specific requirements and existing infrastructure, such as Amazon Redshift or Amazon SageMaker Lakehouse.

This framework contains four main tables: factor_details, factor_summary, factor_timeseries, and factor_values. Here’s a description of their design for factor modeling:

factor_values table stores individual factor values for specific tickers and dates with columns: Factor identification (name, type), Ticker, Datetime, Value.
factor_timeseries table stores factor values and associated portfolio returns over time with columns: Factor identification (name, type), Datetime, Factor value, High and low portfolio returns.
factor_details table stores detailed information about individual factors for specific tickers with columns: Factor identification (name, type), Ticker, Statistical measures (beta, t-statistic, p-value, r-squared), Confidence intervals.
factor_summary table summarizes factor performance over a period. Its columns are: Factor identification (name, type), Date range (start_date, end_date), Statistical measures (avg_beta, avg_tstat, avg_rsquared), Stock counts (significant_stocks, total_stocks), and Performance metrics (annualized_return, annualized_volatility, sharpe_ratio, max_drawdown).

3 Factor mining with parallel computing

Identifying and calculating factors across thousands of securities is time-consuming and constrained by on-premises computational limitations. AWS Batch and AWS Step Functions create a powerful parallel computing solution for factor modeling, orchestrating complex workflows efficiently.

3.1 AWS Batch
Factor mining involves three steps:

1) Calculating daily factor values for stocks and forming portfolios based on factor rankings.
2) Regression to calculate stock-specific factor betas.
3) Assessing factor predictive power and robustness by t-statistics and R-squared.

AWS Batch excels at managing compute environments, job queues, and job definitions. It automatically provisions compute resources for the factor mining. Run thousands of AWS Batch parallel tasks based on the number of tickers and back-testing date ranges requirements.

3.2 AWS Step functions
Step Functions complements AWS Batch by providing a visual workflow to orchestrate parallel batch processes. It allows you to define the sequence of operations, manage state transitions, and handle errors gracefully.

Step Functions start parallel AWS Batch jobs for each layer in factor mining:

Factor: Different factors will be calculated in parallel.
Ticker: Analyze multiple stock tickers concurrently.
Date: Process historical data for various time ranges simultaneously.

4 Visualization of mining results with Streamlit

This framework features a visualization dashboard that displays factor effectiveness using Streamlit. The dashboard provides a clear view of the mining process and its results for factor selection for future trading strategy development.

Using the deployed solution

This section explains how to trigger and interact with the key components of our deployed system. You’ll find instructions for running Lambda functions (both scheduled and manual), retrieving market data, processing financial reports through S3 uploads, and running factor mining workflows via AWS Step Functions. These guidelines will help you effectively use the system’s capabilities for financial data extraction, sentiment analysis, and factor mining, whether you need immediate results or scheduled automated processes.

1 Automatic events
1.1 For the Lambda functions sec-data-lambda and stock_news_fetcher
An Amazon EventBridge event scheduler triggers these two Lambda functions. They operate on a predefined schedule to extract data periodically. If you want to change the data extraction frequency, please edit daily-sec-filing-trigger and daily-stock-news-trigger rules in the Amazon EventBridge console.

Figure 2 Using AWS EventBridge to trigger your lambda functions automatically

To manually run the Lambda functions to see the data extraction results, open the AWS Lambda console, select the function, navigate to the “Test” tab, and select the “Test” button. This will immediately run the Lambda function.

Figure 3 Run AWS Lambda by Test function

1.2 For the Lambda function market-data-collector

This function will retrieve the historical market data using the yfinance API (Yahoo Finance). Run function manually by selecting the Test button in the AWS Lambda console. Likewise, to retrieve the daily market data update automatically, use an AWS EventBridge daily schedule to trigger this function during off-market hours.

Please note that yfinance is a free API with rate limitations that might cause occasional failures. During testing, expect intermittent timeouts or incomplete data retrieval due to these constraints.

For production environments, we strongly recommend using commercial, stable third-party APIs for market data acquisition to ensure reliability and consistent performance.

2 Upload financial reports to S3

To trigger the financial_report_processor Lambda function, use the following steps:

Use a financial report in PDF file format. For example, Amazon’s 2024 annual report.
Rename and upload the PDF to the S3 bucket, for example financial-reports-bucket, following this path structure and PDF naming to show the date of the report.

financial-reports-bucket/ticker/year/YYYYMMDD.pdf

For example,

financial-reports-bucket/AMZN/2025/20250501.pdf

After uploading the file, the Lambda function financial_report_processor will automatically run to perform GenAI sentiment analysis as explained in section 1.4, “Financial Report Processing”.

3 Trigger the step function

An AWS Step Function performs the core factor mining computation. Open the AWS Step Functions console, and you’ll find a state machine named fm-factor-mining. Select the Edit button to view the detailed workflow.

Figure 4 Illustrative diagram of AWS Step Functions workflow for factor mining

This workflow integrates various Lambda functions and Batch Jobs, operating through the following states:

ProcessDateTickerThread: This state invokes the Lambda function process_date_ticker_thread to divide the calculation window based on the input thread_no. For example, if the calculation window is 5 years and thread_no=5, the window will be split into five 1-year segments for parallel processing.
ParallelBatchProcessing: This Map state executes AWS Batch jobs in parallel. These Batch jobs (with batch_no=1) calculate and construct factors, including retrieving stock data, computing factor values, and storing stock returns and portfolio returns in the ClickHouse database.
PrepareTickerProcessing: This state calls the Lambda function process_date_ticker_thread again, but instead of dividing by time periods, it groups all tickers into different batches based on the parallel_m parameter.
TickerBatchProcessing: After grouping the tickers, this Map state invokes AWS Batch jobs in parallel. These Batch jobs (with batch_no=2) test the factors by retrieving stock returns and portfolio returns from the database, performing factor test analysis, and storing the test results back to the database.
SummarizationBatch: This state triggers Batch jobs (with batch_no=3) to evaluate factor performance. It retrieves portfolio returns and factor test results from the database, conducts portfolio evaluation, and stores the summarized results in the factor_summary table.
ProcessResults: The final state invokes the Lambda function process_batch_results to display the final processing results of the Batch jobs.

To run the fm-factor-mining step function, copy the following input data in JSON format, paste it into the input box, and select the Start execution button.

{
"start_date": "2025-01-01",
"end_date": "2025-04-30",
"tickers": "AAPL,AMGN,AMZN,AXP,BA,CAT,CRM,CSCO,CVX,DIS,GS,HD,HON,IBM,JNJ,JPM,KO,MCD,MMM,MRK,MSFT,NKE,NVDA,PG,SHW,TRV,UNH,V,VZ,WMT",
"thread_no": 6, 
"parallel_m": 6,
"factor": "AVGSENT14"
}

4 Visualization

We generated the following factor mining result after running the same factors for different data range on DJIA 30.

T-stat measures the statistical significance of factors. It’s crucial in factor modeling to determine which factors reliably predict returns. The following chart displays T-statistic distributions for various factor types.

Figure 5 T-statistic distributions for factor types

R-squared measures the proportion of variance in the stock return explained by the factors. The chart shows R-squared distributions for different factor types. You can find the variance in the stock return compared to other factor types.

Figure 6 R-squared distributions for factor types

The chart takes Beta into consideration, which measures a stock’s sensitivity to the factor movement. The following chart displays R-squared as bubble size, representing a factor’s explanatory power.

Figure 7 Visualize Beta, T-Statistic, R-squared result

The graph shows that increasing the number of threads for AWS Batch significantly reduces processing time. One example is the market news sentiment factor with LLM, with one thread, the duration is 86 minutes. Using three threads reduces it to 64 minutes, while six threads further reduces it to 48 minutes, showcasing improved efficiency with higher concurrency.

Figure 8 Step Function processing time vs AWS Batch parallelization

Deploying the framework

The project is available on GitHub with Terraform code for automated deployment, simplifying the setup process on individual AWS accounts. By following the provided README and deployment guidance, users quickly implement this factor mining framework for their selected stock pool, such as the DJIA 30 stocks or S&P 500 stocks. This framework leverages AWS scalable infrastructure and advanced GenAI capabilities.

Prerequisites

Before running this code, please ensure you have the following dependencies installed on your system:

These tools are required for successful deployment and execution of the application.

Deployment guidance

The framework on GitHub mainly consists of docs, src, and terraform folder, where docs stores the README.md images, src folder stores all the Python code, and terraform has following folders:

0-prepare/: Shared resources including Lambda layers
1-networking/: Base networking infrastructure (VPC, subnets, etc.)
2-clickhouse/: ClickHouse database deployment
3-jump-host/: Jump host for accessing ClickHouse in the private subnet
4-market-data/: Market data collection infrastructure
5-web-search/: Web search data collection infrastructure
6-sec-filing/: SEC filing data collection infrastructure
7-financial-report/: Financial report processing infrastructure
8-factor-mining/: Factor mining processing infrastructure
9-visualization/: Visualization infrastructure
modules/: Reusable Terraform modules

Before deployment, update the default values in terraform.tfvars in the following folders:

1. In the file./terraform/0-prepare/terraform.tfvars, update the following variables with unique identifiers. For example, you can use your AWS account ID combined with a timestamp to ensure uniqueness (e.g., “123456789012-20250615”). This prevents resource naming conflicts across AWS accounts:

lambda_artifacts_bucket_name = "factor-mining-lambda-artifacts-ACCOUNTID-YYYYMMDD"
terraform_state_bucket_name = "factor-mining-terraform-state-ACCOUNTID-YYYYMMDD"
code_signing_profile_name = "code-signing-profile-ACCOUNTID-YYYYMMDD"

2. In the file./terraform/5-web-search/terraform.tfvars, update the tavily_api_key variable with your personal Travily API key:

tavily_api_key = "YOUR_OWN_TRAVILY_API_KEY"

3. In the file ./terraform/6-sec-filling/terraform.tfvars, update the email variable with your personal email account:

email = "your.email@example.com"

4. In the file ./terraform/7-financial-report/terraform.tfvars, update the financial_reports_bucket_name with a unique name and bedrock_model_id if you want to process the new sentiments by a specific model:

financial_reports_bucket_name = "financial-reports-bucket-ACCOUNTID-YYYYMMDD"

# Amazon Bedrock configuration
bedrock_model_id = "us.anthropic.claude-3-7-sonnet-20250219-v1:0"

After updating the default values, you can simply deploy all modules by entering the following command:

cd terraform
./deployAll.sh

Or you can deploy the modules one by one by entering the following commands to install the necessary dependencies, bootstrap the environment, and deploy the application using the terraform code, for example:

# Navigate to the module folder you want to deploy
cd ./terraform/1-networking

# Deploy the module using the terraform code
./deploy.sh

Note: When deploying modules independently, please adhere to the numerical sequence indicated in each module’s folder name:

Begin with 0-prepare before any other modules
Deploy 1-networking to establish your AWS network environment
Deploy 2-clickhouse to build the ClickHouse database for storing all factor data in the AWS network environment
(optional) Deploy 3-jump-host to create a secure jump host for connecting to your ClickHouse database for troubleshooting
For data source modules 4-market-data, 5-web-search, 6-sec-filing, and 7-financial-report, deploy them only if you need them based on your requirements.
Deploy 8-factor-mining to perform factor construction and calculations
Deploy 9-visualization to display factor mining results through interactive dashboards

Cleaning up

After evaluating the framework and to avoid unnecessary charges, navigate to your deployment folder and run:

terraform destroy

Or run the provided clean up script

cd terraform
./destroyAll.sh

There are more details about cleaning up in the cleanup section in the GitHub deployment guide.

Benefits and conclusion

The cloud-based factor modeling solution we presented offers several benefits for traders and quantitative analysts, including:

Auto-scaling capabilities provide virtually unlimited storage and compute resources, facilitating a comprehensive factor mining across vast datasets without infrastructure constraints. This enables parallelized back-testing, feature extraction, and model training at speeds and scales impractical with traditional on-premises setups.
Generative AI on AWS efficiently parses unstructured data, incorporating valuable insights from sources like market news and financial report paraphrases into factor models, potentially uncovering new sources of alpha.
The provided GitHub sample offers an easy-to-deploy solution, empowering quants to quickly implement this architecture on AWS and start benefiting from advanced factor modeling techniques.

Ready to revolutionize your factor modeling approach? Visit our GitHub repository to access the sample code and deployment instructions. For personalized guidance on implementing this solution for your specific needs, contact your AWS account team. Don’t let traditional infrastructure limitations hold back your quantitative edge – harness the power of cloud-based factor modeling now!

AWS for Industries