My main use case for Apache SkyWalking is a project that started in 2023 for a retail client facing serious performance issues on their new distributed architecture on AWS. The technical criticality is clear. We have an observability black hole on a high-traffic payment flow where we cannot distinguish if latencies are caused by microservices on Amazon EKS or by calls to legacy on-premises databases. We chose Apache SkyWalking through the AWS Marketplace to integrate it immediately into the existing infrastructure with the goal of monitoring a massive environment consisting of over 80 microservices and about 600 active pods. This solution allows us to manage and analyze volumes in the order of 50 million traces per day, correlating every single end-to-end transaction in real time from front end to database and pinpointing bottlenecks that are invisible with traditional logging systems.
Tracing has revealed hybrid bottlenecks and delivers full visibility into critical payment flows
What is our primary use case?
How has it helped my organization?
By analyzing traces provided by Apache SkyWalking, we identified a high volume of redundant and inefficient calls between microservices specifically, and synchronous calls that could be converted to asynchronous processes. Without this insight, we were over-provisioning our AWS resources just to keep the system stable. Once we applied the optimization suggested by the trace data, the system became significantly more efficient. We were able to process 30% more transactions per second using the same infrastructure, which not only improved the end-to-end user experience during the high-traffic sales peaks but also optimized the client's cloud spend by increasing the overall density and performance of our existing 600 pods.
What is most valuable?
Apache SkyWalking provided full visibility into the black hole because before using it, we could not see what was happening when a request left Amazon EKS and went to our on-premises legacy databases. Apache SkyWalking's distributed tracing correlates these two worlds in a single view, showing us that 40% of the latency was actually happening in the network hop between the cloud and the physical data center, not in the code itself.
Second, it exposes hidden architectural flaws. By using the automatic dependency mapping, we discovered that some microservices were stuck in a cyclic dependency which was documented nowhere. This visual evidence allowed us to refactor the logic and immediately increased our throughput by 30%.
Apache SkyWalking gave us database-level insight without database access. Through its slow query monitoring, the Java agents captured the exact SQL statements that were hanging during peak sales hours. This meant our developers could fix the exact line of code or index without needing to wait for a DBA to pull logs, reducing our mean time to resolution.
There are many features that are useful to mention in this case because we obtained different benefits. Apache SkyWalking automatically drew the topology of the 600 pods where we discovered cyclic dependencies between services that no one had documented before and that were slowing down the system. Another valuable feature is resolving hybrid bottlenecks because we isolated a specific network issue between AWS and the physical data center. Without distributed tracing, infrastructure teams blame Java code and vice versa. Database tuning is also important because thanks to slow query metrics captured by the agent, we identified and rewrote the SQL queries that most impacted performance during sales peaks.
What needs improvement?
Apache SkyWalking can be improved with storage management complexity because with this volume of 50 million traces a day, managing data retention on OpenSearch is critical. We had to implement custom logic to purge old data to prevent storage costs from exploding. Another area for improvement is the alert configuration because the out-of-the-box alerting system is basic, and we had to manually write complex rules to avoid false positives in activities that require expert time.
For how long have I used the solution?
Apache SkyWalking is the first solution for this purpose that I have used in my professional experience.
What do I think about the stability of the solution?
Apache SkyWalking is really stable for us.
What do I think about the scalability of the solution?
Handling 50 million traces a day is not trivial. Apache SkyWalking architecture, backed by a well-sized OpenSearch cluster on AWS, handled the load without any loss.
Apache SkyWalking is really scalable because we can break down its scalability into two main areas: horizontal scalability of the backend and storage scalability with Amazon OpenSearch. The real beauty of the system is in its non-blocking nature. The agents use a sampling strategy and asynchronous reporting, meaning that even if the backend is under heavy load, it never slows down the actual retail application. It is built for high-concurrency environments which gives us the confidence to monitor 600 pods simultaneously without fearing a system-wide collapse.
What was our ROI?
The biggest impact is the features that allow us to stop the blame game between the network, cloud, and database teams. By looking at the colored lines on the topology map, which turn red when latency exceeds our threshold, we can instantly see exactly where the bottleneck is located. It transforms a four-hour investigation into a five-minute visual check, and that is the key factor in improving our MTTR and increasing the system's overall throughput by 20-30%.
The mean time to resolution, the time to diagnose a critical incident, drops from four hours to less than one hour and 45 minutes. Coverage also increases; we went from siloed visibility to 100% tracking of critical payment transactions. These are our success metrics.
What's my experience with pricing, setup cost, and licensing?
It is a good experience with pricing, setup cost, and licensing, but I am not in charge of this; the customer is in charge of the purchasing of the tool.
What other advice do I have?
Apache SkyWalking is a very nice tool and an exceptional tool for managing volume and complex architecture on AWS without the prohibitive cost of commercial suites. I would give this product a rating of 9 out of 10.