Our company has a microservice architecture, with different teams in charge of different services. Also, it is a start, which means that we have to build fast and move very fast as well. So before we were properly using DD, we often had issues of things breaking, but without much information on where in our system the breaking happened. This was quite a big-time sync as teams were unfamiliar with other teams' codes, so they needed the help of other teams to debug. This slowed our building down a lot. So implementing dd traces fixed this

External reviews
External reviews are not included in the AWS star rating for the product.
Good alerting and issue detection for many valuable features
What is our primary use case?
What is most valuable?
DataDog has many features, but the most valuable have become our primary uses.
Also, thanks to frequent concurrent deployments, the DataDog alerts monitors allow us quickly detect issues if anything occurs.
What needs improvement?
The monitors can be improved. The chart in the monitors only goes back a couple of hours, clunky. Also, it can provide more info, like traces within the monitors. We have many alerts connected to different notification systems, such as Slack and Opsgenie.
When the on-caller receives notifications fired by the alerts, we are taken to the monitors. Yet often, we have to open up many different tabs to see logs, traces and info that is not accessible on the monitors. I think it would make all of the on callers' lives easier if the monitor had more data
For how long have I used the solution?
We've used the solution for three years.
Unified platform with customizable dashboards and AI-driven insights
What is our primary use case?
Our primary use case for this solution is comprehensive cloud monitoring across our entire infrastructure and application stack.
We operate in a multi-cloud environment, utilizing services from AWS, Azure, and Google Cloud Platform.
Our applications are predominantly containerized and run on Kubernetes clusters. We have a microservices architecture with dozens of services communicating via REST APIs and message queues.
The solution helps us monitor the performance, availability, and resource utilization of our cloud resources, databases, application servers, and front-end applications.
It's essential for maintaining high availability, optimizing costs, and ensuring a smooth user experience for our global customer base. We particularly rely on it for real-time monitoring, alerting, and troubleshooting of production issues.
How has it helped my organization?
Datadog has significantly improved our organization by providing us with great visibility across the entire application stack. This enhanced observability has allowed us to detect and resolve issues faster, often before they impact our end-users.
The unified platform has streamlined our monitoring processes, replacing several disparate tools we previously used. This consolidation has improved team collaboration and reduced context-switching for our DevOps engineers.
The customizable dashboards have made it easier to share relevant metrics with different stakeholders, from developers to C-level executives. We've seen a marked decrease in our mean time to resolution (MTTR) for incidents, and the historical data has been invaluable for capacity planning and performance optimization.
Additionally, the AI-driven insights have helped us proactively identify potential issues and optimize our infrastructure costs.
What is most valuable?
We've found the Application Performance Monitoring (APM) feature to be the most valuable, as it provides great visibility on trace-level data. This granular insight allows us to pinpoint performance bottlenecks and optimize our code more effectively.
The distributed tracing capability has been particularly useful in our microservices environment, helping us understand the flow of requests across different services and identify latency issues.
Additionally, the log management and analytics features have greatly improved our ability to troubleshoot issues by correlating logs with metrics and traces.
The infrastructure monitoring capabilities, especially for our Kubernetes clusters, have helped us optimize resource allocation and reduce costs.
What needs improvement?
While Datadog is an excellent monitoring solution, it could be improved by building more features to replace alerting apps like OpsGenie and PagerDuty. Specifically, we'd like to see more advanced incident management capabilities integrated directly into the platform. This could include features like sophisticated on-call scheduling, escalation policies, and incident response workflows.
Additionally, we'd appreciate more customizable machine learning-driven anomaly detection to help us identify unusual patterns more accurately. Improved support for serverless architectures, particularly for monitoring and tracing AWS Lambda functions, would be beneficial.
Enhanced security monitoring and threat detection capabilities would also be valuable, potentially reducing our reliance on separate security information and event management (SIEM) tools.
For how long have I used the solution?
I've used the solution for two years.
Good dashboards, easy troubleshooting, and integrations
What is our primary use case?
We utilize Datadog mainly to monitor our API integrations and all of the inventory that comes in from our API partners. Each event has its own ID, so we can trace all activity related to each event and troubleshoot where needed.
How has it helped my organization?
Datadog gives non-dev teams insights as to what all is happening with a particular event as well as flags any errors so that we can troubleshoot more efficiently.
What is most valuable?
The dashboards are super convenient to us for a more zoomed out view of what is going on with each integration that we utilize.
What needs improvement?
There could be more easily identifiable documentation on how to find different things on the platform. It can be overwhelming at first glance, and it's hard to find appropriate documentation on the site to lead you to where you need to be.
For how long have I used the solution?
I've used the solution for about 1.5 years.
Monitoring with datadog
Becoming the Gold Standard
This is a good product, but is only just starting to bubble up observability. Takes minutes
APM is growing by leaps and bounds