Debugs slow performance with good support and a straightforward setup
What is our primary use case?
We use Datadog for monitoring the performance of our infrastructure across multiple types of hosts in multiple environments. We also use APM to monitor our applications in production.
We have some Kubernetes clusters and multi-cloud hosts with Datadog agents installed. We have recently added RUM to monitoring our application from the user side, including replay sessions, and are hoping to use those to replace existing monitoring for errors and session replay for debugging issues in the application.
How has it helped my organization?
We have been using Datadog since I started working at the company ten years ago and it has been used for many reasons over the years. Datadog across our services has helped debug slow performance on specific parts of our application, which, in turn, allows us to provide a snappier and more performant application for our customers.
The monitoring and alerting system has allowed our team to be aware of the issues that have come up in our production system and react faster with more tools to debug and view to keep the system online for our customers.
What is most valuable?
Datadog infrastructure monitoring has helped us identify health issues with our virtual machines, such as high load, CPU, and disk usage, as well as monitoring uptime and alerting when Kubernetes containers have a bad time staying up. Our use of Datadog's Application Monitoring, APM over the last six years or so has been crucial to identifying performance and bottleneck issues as well as alerting us when services are seeing high error rates, which have made it easier to debug when specific services may be going down.
What needs improvement?
We have found that some of the different options for filtering for logs ingestion, APM traces and span ingestion, and RUM sessions vs replay settings can be hard to discover and tough to determine how to adjust and tweak for both optimal performance and monitoring as well as for billing within the console.
It can sometimes be difficult to determine which information is documented, as we have found inconsistencies with deprecated information, such as environment variables within the documentation.
For how long have I used the solution?
I've been using the solution for ten years.
What do I think about the stability of the solution?
The solution seems pretty stable, as we've been using it for more than a decade.
What do I think about the scalability of the solution?
The solution seems quite scalable, especially within Kubernetes. Costs are a factor.
How are customer service and support?
SUpport has been very helpful whenever we need it.
How would you rate customer service and support?
Which solution did I use previously and why did I switch?
We had tried some other APM monitoring in the past, however, it was too expensive, and then we added it to Datadog since we were already using Datadog and it seemed like a good value add.
How was the initial setup?
The solution is straightforward to set up. Sometimes, it is complex to find the correct documentation.
What about the implementation team?
We handled the setup in-house.
What was our ROI?
Our ROI is ease of mind with alerts and monitoring, as well as the ability to review and debug issues for our customers.
What's my experience with pricing, setup cost, and licensing?
Getting settled on pricing is something you want to keep an eye on, as things seem to change regularly.
Which other solutions did I evaluate?
We used New Relic previously.
What other advice do I have?
Datadog is a great service that is continually growing its solution for monitoring and security. It is easy to set up and turn on and off its features once you have instrumented agents and tailored solutions to your needs.
Which deployment model are you using for this solution?
Public Cloud
If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?
Other
Easy to configure with synthetic testing and offers a consolidated approach to monitoring
What is our primary use case?
We use this solution for enterprise monitoring across a large number of applications in multiple environments like production, development, and testing. It helps us track application performance, uptime, and resource usage in real time, providing alerts for issues like downtime or performance bottlenecks.
Our hybrid environment includes cloud and on-premise infrastructure. The solution is crucial for ensuring reliability, compliance, and high availability across our diverse application landscape.
How has it helped my organization?
Datadog has greatly improved our organization by centralizing all monitoring into one platform, allowing us to consolidate data from a wide range of sources.
From infrastructure metrics and application logs to end-user experience and device monitoring, everything is now collected and displayed in one place. This has simplified our monitoring processes, improved visibility, and allowed for faster issue detection and resolution.
By streamlining these operations, Datadog has enhanced both efficiency and collaboration across teams.
What is most valuable?
Synthetic testing is by far the most valuable feature in our organization. It’s highly requested since the setup process is both quick and straightforward, allowing us to simulate user interactions across our applications with minimal effort.
The ease of configuring tests and interpreting the results makes it accessible even to non-technical team members. This feature provides valuable insights into user experience, helps identify performance bottlenecks, and ensures that our critical workflows are functioning as expected, enhancing reliability and uptime.
What needs improvement?
One area where the product could be improved is Application Performance Monitoring (APM). While it's a powerful feature, many in our organization find it difficult to fully understand and utilize to its maximum potential.
The data provided is comprehensive, yet it can sometimes be overwhelming, especially for those who are less familiar with the intricacies of application performance metrics.
Simplifying the interface, offering clearer guidance, or providing more intuitive visualizations would make it easier for users to extract valuable insights quickly and efficiently.
For how long have I used the solution?
I've used the solution for four years.
What do I think about the stability of the solution?
The solution is very stable. Issues happen once or twice a year and are usually solved before we have any real impact on the service.
What do I think about the scalability of the solution?
Scalability has never been a bottleneck for us; we've never felt any issues here.
How are customer service and support?
Support is slow at the beginning, however, they are much better and responsive now.
How would you rate customer service and support?
Which solution did I use previously and why did I switch?
Datadog offered the most consolidated approach to our monitoring needs.
How was the initial setup?
This was a migration project, so it was rather complex.
What about the implementation team?
We implemented the solution with our in-house team.
What's my experience with pricing, setup cost, and licensing?
I'd recommend new users look down the road and decide on at least a three-year plan.
Which other solutions did I evaluate?
Improved response time and cost-efficiency with good monitoring
What is our primary use case?
We monitor our multiple platforms using Datadog and post alerts to Slack to notify us of server and end-user issues. We also monitor user sessions to help troubleshoot an issue being reported.
We monitor 3.5 platforms on our Datadog instance, and the team always monitors the trends and Dashboards we set up. We have two instances to span the 3.5 platforms and are currently looking to implement more platform monitoring over time. The user session monitoring is consistent for one of these platforms.
How has it helped my organization?
Datadog has improved our response time and cost-efficiency in bug reporting and server maintenance. We're able to track our servers more fluidly, allowing us to expand our outreach and decrease response time.
There are many different ways that Datadog is used, and we monitor three and a half platforms on the Datadog environment at this time. By monitoring all of these platforms in one easy-to-use instance, we're able to track the platform with the issue, the issue itself, and its impact on the end user.
What is most valuable?
The server monitoring, service monitoring, and user session monitoring are extremely helpful, as they allow us to be alerted ahead of time of issues that users might experience. More often than not, an issue is not only able to be identified, but solved and released before an end user notices an issue.
We are currently using this as an investigative tool to notice trends, identify issues, and locate areas of our program that we can improve upon that haven't been identified as pain points yet. This is another effective use case.
What needs improvement?
I would like to see a longer retention time of user sessions, even if by 24 to 48 hours, or even just having the option to be configurable. By doing this, we're enabled to store user sessions that have remained invisible for a long time, and identify issues that people are working around.
I would also like to see an improvement in the server's data extraction times, as sometimes it can take up to ten minutes to download a report for a critical issue that is costing us money. Regardless, I am very happy with Datadog and love the uses we have for the program so far.
For how long have I used the solution?
I've used the solution for more than four years.
Which solution did I use previously and why did I switch?
We did not previously use a different solution.
Improves monitoring and observability with actionable alerts
What is our primary use case?
We are using Datadog to improve our monitoring and observability so we can hopefully improve our customer experience and reliability.
I have been using Datadog to build better actionable alerts to help teams across the enterprise. Also by using Datadog we are hoping to have improved observability into our apps and we are also taking advantage of this process to improve our tagging strategy so teams can hopefully troubleshoot incidents faster and a much reduced mean time to resolve.
We have a lot of different resources we use like Kubernetes, App Gateway and Cosmos DB just to name a few.
How has it helped my organization?
As soon as we started implementing Datadog into our cloud environment people really like how it looked and how easy it was to navigate. We could see the most data in our Kubernetes environments than we ever could.
Some people liked how the logs were color coded so it was easy to see what kind of log you were looking at. The ease of making dashboards has also been greatly received as a benefit.
People have commented that there is so much information that it takes a time to digest and get used to what you are looking at and finding what you are looking for.
What is most valuable?
The selection of monitors is a big feature I have been working with. Previously with Azure Monitor we couldn't do a whole lot with their alerts. The log alerts can sometimes take a while to ingest. Also, we couldn't do any math with the metrics we received from logs to make better alerts from logs.
The metric alerts are ok but are still very limited. With Datadog, we can make a wide range of different monitors that we can tweak in real time because there is a graph of data as you are creating the alert which is very beneficial. The ease of making dashboards has saved a lot of people a lot of time. No KQL queries to put together the information you are looking for and the ability to pin any info you see into a dashboard is very convenient.
RUM is another feature we are looking forward to using this upcoming tax season, as we will have a front-row view into what frustrates customers or where things go wrong in their process of using our site.
What needs improvement?
The PagerDuty integration could be a little bit better. If there was a way to format the monitors to different incident management software that would be awesome. As of right now, it takes a lot of manipulating of PagerDuty to get the monitors from Datadog to populate all the fields we want in PagerDuty.
I love the fact you can query data without using something like KQL. However, it would also be helpful if there was a way to convert a complex KQL query into Datadog to be able to retrieve the same data - especially for very specific scenarios that some app teams may want to look for.
For how long have I used the solution?
I've used the solution for about two years.
Which solution did I use previously and why did I switch?
We previously used Azure Monitor, App Insights, and Log Analytics. We switched because it was a lot for developers and SREs to switch between three screens to try troubleshoot and when you add in the slow load times from Azure it can take a while to get things done.
What's my experience with pricing, setup cost, and licensing?
I would advise taking a close look at logging costs, man-hours needed, and the amount of time it takes for people to get comfortable navigating Datadog because there is so much information that it can be overwhelming to narrow down what you need.
Which other solutions did I evaluate?
We did evaluate DynaTrace and looked into New Relic before settling on Datadog.
Which deployment model are you using for this solution?
Hybrid Cloud
Centralized pipeline with synthetic testing and a customized dashboard
What is our primary use case?
Our primary use case is custom and vendor-supplied web application log aggregation, performance tracing and alerting.
We run a mix of AWS EC2, Azure serverless, and colocated VMWare servers to support higher education web applications. Managing a hybrid multi-cloud solution across hundreds of applications is always a challenge.
Datadog agents on each web host, and native integrations with GitHub, AWS, and Azure gets all of our instrumentation and error data in one place for easy analysis and monitoring.
How has it helped my organization?
Through the use of Datadog across all of our apps, we were able to consolidate a number of alerting and error-tracking apps, and Datadog ties them all together in cohesive dashboards.
Whether the app is vendor-supplied or we built it ourselves, the depth of tracing, profiling, and hooking into logs is all obtainable and tunable. Both legacy .NET Framework and Windows Event Viewer and cutting-edge .NET Core with streaming logs all work. The breadth of coverage for any app type or situation is really incredible. It feels like there's nothing we can't monitor.
What is most valuable?
Centralized pipeline tracking and error logging provide a comprehensive view of our development and deployment processes, making it much easier to identify and resolve issues quickly.
Synthetic testing has been a game-changer, allowing us to catch potential problems before they impact real users. Real user monitoring gives us invaluable insights into actual user experiences, helping us prioritize improvements where they matter most.
The ability to create custom dashboards has been incredibly useful, allowing us to visualize key metrics and KPIs in a way that makes sense for different teams and stakeholders.
These features form a powerful toolkit that helps us maintain high performance and reliability across our applications and infrastructure, ultimately leading to better user satisfaction and more efficient operations.
What needs improvement?
I'd like to see an expansion of the Android and IOS apps to have a simplified CI/CD pipeline history view.
I like the idea of monitoring on the go, yet it seems the options are still a bit limited out of the box. While the documentation is very good considering all the frameworks and technology Datadog covers, there are areas - specifically .NET Profiling and Tracing of IIS-hosted apps - that need a lot of focus to pick up on the key details needed.
In some cases the screenshots don't match the text as updates are made. I spent longer than I should have figuring out how to correlate logs to traces, mostly related to environmental variables.
For how long have I used the solution?
I've used the solution for about three years.
What do I think about the stability of the solution?
We have been impressed with the uptime and clean and light resource usage of the agents.
What do I think about the scalability of the solution?
The solution has been very scalable and customizable.
How are customer service and support?
Sales service is always helpful in tuning our committed costs and alerting us when we start spending outside the on-demand budget.
Which solution did I use previously and why did I switch?
We used a mix of a custom error email system, SolarWinds, UptimeRobot, and GitHub actions. We switched to find one platform that could give deep app visibility regardless of whether it is Linux or Windows or Container, cloud or on-prem hosted.
How was the initial setup?
Generally simple, but .NET Profiling of IIS and aligning logs to traces and profiles was a challenge.
What about the implementation team?
We implemented the solution in-house.
What was our ROI?
I'd count our ROI as significant time saved by the development team assessing bugs and performance issues.
What's my experience with pricing, setup cost, and licensing?
Set up live trials to asses cost scaling. Small decisions around how monitors are used can have big impacts on cost scaling.
Which other solutions did I evaluate?
NewRelic was considered. LogicMonitor was chosen over Datadog for our network and campus server management use cases.
What other advice do I have?
Excited to dig further into the new offerings around LLM and continue to grow our footprint in Datadog.
Which deployment model are you using for this solution?
Hybrid Cloud
If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?
Microsoft Azure
Good monitoring capabilities, centralizing of logs, and making data easily searchable
What is our primary use case?
Our primary use of Datadog involves monitoring over 50 microservices deployed across three distinct environments. These services vary widely in their functions and resource requirements.
We rely on Datadog to track usage metrics, gather logs, and provide insight into service performance and health. Its flexibility allows us to efficiently monitor both production and development environments, ensuring quick detection and response to any anomalies.
We also have better insight into metrics like latency and memory usage.
How has it helped my organization?
Datadog has significantly improved our organization’s monitoring capabilities by centralizing all of our logs and making them easily searchable. This has streamlined our troubleshooting process, allowing for quicker root cause analysis.
Additionally, its ease of implementation meant that we could cover all of our services comprehensively, ensuring that logs and metrics were thoroughly captured across our entire ecosystem. This has enhanced our ability to maintain system reliability and performance.
What is most valuable?
The intuitive user interface has been one of the most valuable features for us. Unlike other platforms like Grafana, as an example, where learning how to query either involves a lot of trial and error or memorization almost like learning a new language, Datadog’s UI makes finding logs, metrics, and performance data straightforward and efficient. This ease of use has saved us time and reduced the learning curve for new team members, allowing us to focus more on analysis and troubleshooting rather than on learning the tool itself.
What needs improvement?
While the UI and search functionality are excellent, further improvement could be made in the querying of logs by offering more advanced templates or suggestions based on common use cases. This would help users discover powerful queries they might not think to create themselves.
Additionally, enhancing alerting capabilities with more customizable thresholds or automated recommendations could provide better insights, especially when dealing with complex environments like ours with numerous microservices.
For how long have I used the solution?
I've used the solution for five years.
What do I think about the stability of the solution?
We have never experienced any downtime.
Which solution did I use previously and why did I switch?
We previously used Sumo Logic.
Which deployment model are you using for this solution?
Public Cloud
Excellent for monitoring, analyzing, and optimizing performance
What is our primary use case?
Our primary use case for Datadog is monitoring, analyzing, and optimizing the performance and health of our applications and infrastructure.
We leverage its logging, metrics, and tracing capabilities to pinpoint issues, track system performance, and improve overall reliability. Datadog’s ability to provide real-time insights and alerting on key metrics helps us quickly address issues, ensuring smooth operations.
It’s integral for visibility across our microservices architecture and cloud environments.
How has it helped my organization?
Datadog has been incredibly valuable to our organization. Its ability to pinpoint warnings and errors in logs and provide detailed context is essential for troubleshooting.
The platform's request tracing feature offers comprehensive insights into user flows, allowing us to quickly identify issues and optimize performance.
Additionally, Datadog's real-time monitoring and alerting capabilities help us proactively manage system health, ensuring operational efficiency across our applications and infrastructure.
What is most valuable?
Being able to filter requests by latency is invaluable, as it provides immediate insight into which endpoints require further analysis and optimization. This feature helps us quickly identify performance bottlenecks and prioritize improvements.
Additionally, the ability to filter requests by user email is extremely useful for tracking down user-specific issues faster. It streamlines the troubleshooting process and enables us to provide more targeted support to individual users, improving overall customer satisfaction.
What needs improvement?
The query performance could be improved, particularly when handling large datasets, as slower response times can hinder efficiency. Additionally, the interface can sometimes feel overwhelming, with so much happening at once, which may discourage users from exploring new features. Simplifying the layout or providing clearer guidance could enhance user experience. Any improvements related to query optimization would be highly beneficial, as it would further streamline workflows and boost productivity.
For how long have I used the solution?
I've used the solution for five years.
Easy to use with good speed and helpful dashboards
What is our primary use case?
We are using Datadog to improve our cloud monitoring and observability across our enterprise apps. We have integrated a lot of different resources into Datadog, like Kubernetes, App Gateways, App Service Environments, App Service Plans, and other Web App resources.
I will be using the monitoring and observability features of Datadog. Dashboards are used very heavily by teams and SREs. We really have seen that Datadog has already improved both our monitoring and our observability.
How has it helped my organization?
The ease and speed of which you can create a dashboard has been a huge improvement.
The different types of monitors we can create have been huge, too. We can do so many different things with monitors that we couldn't do before with our alerts.
Being able to click on a trace or log and drill down on it to see what happened has been great.
Some have found the learning curve a bit steep. That said,they are coming around slowly. There is just a lot of information to learn how to navigate.
What is most valuable?
The different types of monitors have been very valuable. We have been able to make our alerts (monitors) more actionable than we were able to previously.
Watchdog is a favorite feature among a lot of the devs. It catches things they didn't even know were an issue.
RUM is another feature a lot of us are looking forward to seeing how it can help us improve our customer experience during tax season.
We hope to enable the code review feature at some point to so we can see what code caused the issue.
What needs improvement?
I would like to see the integration between PagerDuty and Datadog improved. The tags in Datadog don't match those in PagerDuty, and we have to make it work. Also, I would like to see if the ability to replicate a KQL query in Datadog is made easier or better.
I would like to see the alert communications to email or phones made better so we could hopefully move off PagerDuty and just use Datadog for that.
There are also a lot of features that we haven't budgeted for yet and I would like for us to be able to use them in the future.
For how long have I used the solution?
I've used the solution for about two years.
Which deployment model are you using for this solution?
Hybrid Cloud
Excellent APM, RUM and dashboards
What is our primary use case?
We use the solution for APM, anomaly detection, resource metrics, RUM, and synthetics.
We use it to build baseline metrics for our apps before we start focusing in on performance improvements. A lot of times that’s looking at methods that take too long to run and diving into db queries and parsing.
I’ve used it in multiple configurations in aws and azure. I’ve built it using terraform and hand rolled.
I’ve used it predominantly with Ruby and Node and a little bit of Python.
How has it helped my organization?
The solution provides deep insights into our stack. It gives us the ability to measure and monitor before making decisions.
We're using it to make informed decisions about performance. Being able to show how across a timeline we increased performance from a release via a visual indication of p50+ metrics is almost magical.
Another way we use it is for leading indicators of issues that might be happening. So for example, anomaly detection on gauge metrics across the app and having synthetics build in with alerting configurations are both ways we can get alerted sometimes even before a big issue is about to happen.
What is most valuable?
The most valuable aspects include APM, RUM and dashboards.
I think of Datadog as an analytics company first. And that the integrations around notifications and alerts as a part of insight discoverability.
Everything Datadog offers for me is around knowledge building and how much do I know about the deep details of my stack.
The pricing model makes more sense than what we paid for against other competitors. I was at one job where we used two competing services because DD didn’t have BAA for APM. And then when it offered it, we immediately dumped the other solution for Datadog.
What needs improvement?
Logging is not a great experience. Searching for specific logs and then navigating around the context of the results is slow and cumbersome. Honestly that is my only gripe for Datadog. It’s a wonderful product outside of log searching. I have had better experience using other services that aggregate logs for search.
My use case for it is around discoverability. Log search is fine if I’m just looking for something specific. That said, if it’s something else targeted and I am wandering around looking for possible issues, it’s really unintuitive.
For how long have I used the solution?
I've used the solution for more than eight years.
What do I think about the stability of the solution?
What about the implementation team?
We always implement the solution in-house.
Which deployment model are you using for this solution?
Private Cloud
Capable of pinpointing warnings and errors in logs and provide detailed context
What is our primary use case?
Our primary use case for Datadog is to monitor, analyze, and optimize the performance and health of our applications and infrastructure.
We leverage its logging, metrics, and tracing capabilities to pinpoint issues, track system performance, and improve overall reliability.
Datadog’s ability to provide real-time insights and alerting on key metrics helps us quickly address issues, ensuring smooth operations. It’s integral for visibility across our microservices architecture and cloud environments.
How has it helped my organization?
Datadog has been incredibly valuable to our organization. Its ability to pinpoint warnings and errors in logs and provide detailed context is essential for troubleshooting.
The platform's request tracing feature offers comprehensive insights into user flows, allowing us to quickly identify issues and optimize performance.
Additionally, Datadog's real-time monitoring and alerting capabilities help us proactively manage system health, ensuring operational efficiency across our applications and infrastructure.
What is most valuable?
Being able to filter requests by latency is invaluable, as it provides immediate insight into which endpoints require further analysis and optimization. This feature helps us quickly identify performance bottlenecks and prioritize improvements.
Additionally, the ability to filter requests by user email is extremely useful for tracking down user-specific issues faster. It streamlines the troubleshooting process and enables us to provide more targeted support to individual users, improving overall customer satisfaction.
What needs improvement?
The query performance could be improved, particularly when handling large datasets, as slower response times can hinder efficiency.
Additionally, the interface can sometimes feel overwhelming, with so much happening at once, which may discourage users from exploring new features.
Simplifying the layout or providing clearer guidance could enhance user experience. Any improvements related to query optimization would be highly beneficial, as it would further streamline workflows and boost productivity.
For how long have I used the solution?
I've used the solution for five years.