Improved response time and cost-efficiency with good monitoring
What is our primary use case?
We monitor our multiple platforms using Datadog and post alerts to Slack to notify us of server and end-user issues. We also monitor user sessions to help troubleshoot an issue being reported.
We monitor 3.5 platforms on our Datadog instance, and the team always monitors the trends and Dashboards we set up. We have two instances to span the 3.5 platforms and are currently looking to implement more platform monitoring over time. The user session monitoring is consistent for one of these platforms.
How has it helped my organization?
Datadog has improved our response time and cost-efficiency in bug reporting and server maintenance. We're able to track our servers more fluidly, allowing us to expand our outreach and decrease response time.
There are many different ways that Datadog is used, and we monitor three and a half platforms on the Datadog environment at this time. By monitoring all of these platforms in one easy-to-use instance, we're able to track the platform with the issue, the issue itself, and its impact on the end user.
What is most valuable?
The server monitoring, service monitoring, and user session monitoring are extremely helpful, as they allow us to be alerted ahead of time of issues that users might experience. More often than not, an issue is not only able to be identified, but solved and released before an end user notices an issue.
We are currently using this as an investigative tool to notice trends, identify issues, and locate areas of our program that we can improve upon that haven't been identified as pain points yet. This is another effective use case.
What needs improvement?
I would like to see a longer retention time of user sessions, even if by 24 to 48 hours, or even just having the option to be configurable. By doing this, we're enabled to store user sessions that have remained invisible for a long time, and identify issues that people are working around.
I would also like to see an improvement in the server's data extraction times, as sometimes it can take up to ten minutes to download a report for a critical issue that is costing us money. Regardless, I am very happy with Datadog and love the uses we have for the program so far.
For how long have I used the solution?
I've used the solution for more than four years.
Which solution did I use previously and why did I switch?
We did not previously use a different solution.
Improves monitoring and observability with actionable alerts
What is our primary use case?
We are using Datadog to improve our monitoring and observability so we can hopefully improve our customer experience and reliability.
I have been using Datadog to build better actionable alerts to help teams across the enterprise. Also by using Datadog we are hoping to have improved observability into our apps and we are also taking advantage of this process to improve our tagging strategy so teams can hopefully troubleshoot incidents faster and a much reduced mean time to resolve.
We have a lot of different resources we use like Kubernetes, App Gateway and Cosmos DB just to name a few.
How has it helped my organization?
As soon as we started implementing Datadog into our cloud environment people really like how it looked and how easy it was to navigate. We could see the most data in our Kubernetes environments than we ever could.
Some people liked how the logs were color coded so it was easy to see what kind of log you were looking at. The ease of making dashboards has also been greatly received as a benefit.
People have commented that there is so much information that it takes a time to digest and get used to what you are looking at and finding what you are looking for.
What is most valuable?
The selection of monitors is a big feature I have been working with. Previously with Azure Monitor we couldn't do a whole lot with their alerts. The log alerts can sometimes take a while to ingest. Also, we couldn't do any math with the metrics we received from logs to make better alerts from logs.
The metric alerts are ok but are still very limited. With Datadog, we can make a wide range of different monitors that we can tweak in real time because there is a graph of data as you are creating the alert which is very beneficial. The ease of making dashboards has saved a lot of people a lot of time. No KQL queries to put together the information you are looking for and the ability to pin any info you see into a dashboard is very convenient.
RUM is another feature we are looking forward to using this upcoming tax season, as we will have a front-row view into what frustrates customers or where things go wrong in their process of using our site.
What needs improvement?
The PagerDuty integration could be a little bit better. If there was a way to format the monitors to different incident management software that would be awesome. As of right now, it takes a lot of manipulating of PagerDuty to get the monitors from Datadog to populate all the fields we want in PagerDuty.
I love the fact you can query data without using something like KQL. However, it would also be helpful if there was a way to convert a complex KQL query into Datadog to be able to retrieve the same data - especially for very specific scenarios that some app teams may want to look for.
For how long have I used the solution?
I've used the solution for about two years.
Which solution did I use previously and why did I switch?
We previously used Azure Monitor, App Insights, and Log Analytics. We switched because it was a lot for developers and SREs to switch between three screens to try troubleshoot and when you add in the slow load times from Azure it can take a while to get things done.
What's my experience with pricing, setup cost, and licensing?
I would advise taking a close look at logging costs, man-hours needed, and the amount of time it takes for people to get comfortable navigating Datadog because there is so much information that it can be overwhelming to narrow down what you need.
Which other solutions did I evaluate?
We did evaluate DynaTrace and looked into New Relic before settling on Datadog.
Which deployment model are you using for this solution?
Hybrid Cloud
Easy to use with good speed and helpful dashboards
What is our primary use case?
We are using Datadog to improve our cloud monitoring and observability across our enterprise apps. We have integrated a lot of different resources into Datadog, like Kubernetes, App Gateways, App Service Environments, App Service Plans, and other Web App resources.
I will be using the monitoring and observability features of Datadog. Dashboards are used very heavily by teams and SREs. We really have seen that Datadog has already improved both our monitoring and our observability.
How has it helped my organization?
The ease and speed of which you can create a dashboard has been a huge improvement.
The different types of monitors we can create have been huge, too. We can do so many different things with monitors that we couldn't do before with our alerts.
Being able to click on a trace or log and drill down on it to see what happened has been great.
Some have found the learning curve a bit steep. That said,they are coming around slowly. There is just a lot of information to learn how to navigate.
What is most valuable?
The different types of monitors have been very valuable. We have been able to make our alerts (monitors) more actionable than we were able to previously.
Watchdog is a favorite feature among a lot of the devs. It catches things they didn't even know were an issue.
RUM is another feature a lot of us are looking forward to seeing how it can help us improve our customer experience during tax season.
We hope to enable the code review feature at some point to so we can see what code caused the issue.
What needs improvement?
I would like to see the integration between PagerDuty and Datadog improved. The tags in Datadog don't match those in PagerDuty, and we have to make it work. Also, I would like to see if the ability to replicate a KQL query in Datadog is made easier or better.
I would like to see the alert communications to email or phones made better so we could hopefully move off PagerDuty and just use Datadog for that.
There are also a lot of features that we haven't budgeted for yet and I would like for us to be able to use them in the future.
For how long have I used the solution?
I've used the solution for about two years.
Which deployment model are you using for this solution?
Hybrid Cloud
Centralized pipeline with synthetic testing and a customized dashboard
What is our primary use case?
Our primary use case is custom and vendor-supplied web application log aggregation, performance tracing and alerting.
We run a mix of AWS EC2, Azure serverless, and colocated VMWare servers to support higher education web applications. Managing a hybrid multi-cloud solution across hundreds of applications is always a challenge.
Datadog agents on each web host, and native integrations with GitHub, AWS, and Azure gets all of our instrumentation and error data in one place for easy analysis and monitoring.
How has it helped my organization?
Through the use of Datadog across all of our apps, we were able to consolidate a number of alerting and error-tracking apps, and Datadog ties them all together in cohesive dashboards.
Whether the app is vendor-supplied or we built it ourselves, the depth of tracing, profiling, and hooking into logs is all obtainable and tunable. Both legacy .NET Framework and Windows Event Viewer and cutting-edge .NET Core with streaming logs all work. The breadth of coverage for any app type or situation is really incredible. It feels like there's nothing we can't monitor.
What is most valuable?
Centralized pipeline tracking and error logging provide a comprehensive view of our development and deployment processes, making it much easier to identify and resolve issues quickly.
Synthetic testing has been a game-changer, allowing us to catch potential problems before they impact real users. Real user monitoring gives us invaluable insights into actual user experiences, helping us prioritize improvements where they matter most.
The ability to create custom dashboards has been incredibly useful, allowing us to visualize key metrics and KPIs in a way that makes sense for different teams and stakeholders.
These features form a powerful toolkit that helps us maintain high performance and reliability across our applications and infrastructure, ultimately leading to better user satisfaction and more efficient operations.
What needs improvement?
I'd like to see an expansion of the Android and IOS apps to have a simplified CI/CD pipeline history view.
I like the idea of monitoring on the go, yet it seems the options are still a bit limited out of the box. While the documentation is very good considering all the frameworks and technology Datadog covers, there are areas - specifically .NET Profiling and Tracing of IIS-hosted apps - that need a lot of focus to pick up on the key details needed.
In some cases the screenshots don't match the text as updates are made. I spent longer than I should have figuring out how to correlate logs to traces, mostly related to environmental variables.
For how long have I used the solution?
I've used the solution for about three years.
What do I think about the stability of the solution?
We have been impressed with the uptime and clean and light resource usage of the agents.
What do I think about the scalability of the solution?
The solution has been very scalable and customizable.
How are customer service and support?
Sales service is always helpful in tuning our committed costs and alerting us when we start spending outside the on-demand budget.
Which solution did I use previously and why did I switch?
We used a mix of a custom error email system, SolarWinds, UptimeRobot, and GitHub actions. We switched to find one platform that could give deep app visibility regardless of whether it is Linux or Windows or Container, cloud or on-prem hosted.
How was the initial setup?
Generally simple, but .NET Profiling of IIS and aligning logs to traces and profiles was a challenge.
What about the implementation team?
We implemented the solution in-house.
What was our ROI?
I'd count our ROI as significant time saved by the development team assessing bugs and performance issues.
What's my experience with pricing, setup cost, and licensing?
Set up live trials to asses cost scaling. Small decisions around how monitors are used can have big impacts on cost scaling.
Which other solutions did I evaluate?
NewRelic was considered. LogicMonitor was chosen over Datadog for our network and campus server management use cases.
What other advice do I have?
Excited to dig further into the new offerings around LLM and continue to grow our footprint in Datadog.
Which deployment model are you using for this solution?
Hybrid Cloud
If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?
Microsoft Azure
Excellent for monitoring, analyzing, and optimizing performance
What is our primary use case?
Our primary use case for Datadog is monitoring, analyzing, and optimizing the performance and health of our applications and infrastructure.
We leverage its logging, metrics, and tracing capabilities to pinpoint issues, track system performance, and improve overall reliability. Datadog’s ability to provide real-time insights and alerting on key metrics helps us quickly address issues, ensuring smooth operations.
It’s integral for visibility across our microservices architecture and cloud environments.
How has it helped my organization?
Datadog has been incredibly valuable to our organization. Its ability to pinpoint warnings and errors in logs and provide detailed context is essential for troubleshooting.
The platform's request tracing feature offers comprehensive insights into user flows, allowing us to quickly identify issues and optimize performance.
Additionally, Datadog's real-time monitoring and alerting capabilities help us proactively manage system health, ensuring operational efficiency across our applications and infrastructure.
What is most valuable?
Being able to filter requests by latency is invaluable, as it provides immediate insight into which endpoints require further analysis and optimization. This feature helps us quickly identify performance bottlenecks and prioritize improvements.
Additionally, the ability to filter requests by user email is extremely useful for tracking down user-specific issues faster. It streamlines the troubleshooting process and enables us to provide more targeted support to individual users, improving overall customer satisfaction.
What needs improvement?
The query performance could be improved, particularly when handling large datasets, as slower response times can hinder efficiency. Additionally, the interface can sometimes feel overwhelming, with so much happening at once, which may discourage users from exploring new features. Simplifying the layout or providing clearer guidance could enhance user experience. Any improvements related to query optimization would be highly beneficial, as it would further streamline workflows and boost productivity.
For how long have I used the solution?
I've used the solution for five years.
Good monitoring capabilities, centralizing of logs, and making data easily searchable
What is our primary use case?
Our primary use of Datadog involves monitoring over 50 microservices deployed across three distinct environments. These services vary widely in their functions and resource requirements.
We rely on Datadog to track usage metrics, gather logs, and provide insight into service performance and health. Its flexibility allows us to efficiently monitor both production and development environments, ensuring quick detection and response to any anomalies.
We also have better insight into metrics like latency and memory usage.
How has it helped my organization?
Datadog has significantly improved our organization’s monitoring capabilities by centralizing all of our logs and making them easily searchable. This has streamlined our troubleshooting process, allowing for quicker root cause analysis.
Additionally, its ease of implementation meant that we could cover all of our services comprehensively, ensuring that logs and metrics were thoroughly captured across our entire ecosystem. This has enhanced our ability to maintain system reliability and performance.
What is most valuable?
The intuitive user interface has been one of the most valuable features for us. Unlike other platforms like Grafana, as an example, where learning how to query either involves a lot of trial and error or memorization almost like learning a new language, Datadog’s UI makes finding logs, metrics, and performance data straightforward and efficient. This ease of use has saved us time and reduced the learning curve for new team members, allowing us to focus more on analysis and troubleshooting rather than on learning the tool itself.
What needs improvement?
While the UI and search functionality are excellent, further improvement could be made in the querying of logs by offering more advanced templates or suggestions based on common use cases. This would help users discover powerful queries they might not think to create themselves.
Additionally, enhancing alerting capabilities with more customizable thresholds or automated recommendations could provide better insights, especially when dealing with complex environments like ours with numerous microservices.
For how long have I used the solution?
I've used the solution for five years.
What do I think about the stability of the solution?
We have never experienced any downtime.
Which solution did I use previously and why did I switch?
We previously used Sumo Logic.
Which deployment model are you using for this solution?
Public Cloud
Excellent APM, RUM and dashboards
What is our primary use case?
We use the solution for APM, anomaly detection, resource metrics, RUM, and synthetics.
We use it to build baseline metrics for our apps before we start focusing in on performance improvements. A lot of times that’s looking at methods that take too long to run and diving into db queries and parsing.
I’ve used it in multiple configurations in aws and azure. I’ve built it using terraform and hand rolled.
I’ve used it predominantly with Ruby and Node and a little bit of Python.
How has it helped my organization?
The solution provides deep insights into our stack. It gives us the ability to measure and monitor before making decisions.
We're using it to make informed decisions about performance. Being able to show how across a timeline we increased performance from a release via a visual indication of p50+ metrics is almost magical.
Another way we use it is for leading indicators of issues that might be happening. So for example, anomaly detection on gauge metrics across the app and having synthetics build in with alerting configurations are both ways we can get alerted sometimes even before a big issue is about to happen.
What is most valuable?
The most valuable aspects include APM, RUM and dashboards.
I think of Datadog as an analytics company first. And that the integrations around notifications and alerts as a part of insight discoverability.
Everything Datadog offers for me is around knowledge building and how much do I know about the deep details of my stack.
The pricing model makes more sense than what we paid for against other competitors. I was at one job where we used two competing services because DD didn’t have BAA for APM. And then when it offered it, we immediately dumped the other solution for Datadog.
What needs improvement?
Logging is not a great experience. Searching for specific logs and then navigating around the context of the results is slow and cumbersome. Honestly that is my only gripe for Datadog. It’s a wonderful product outside of log searching. I have had better experience using other services that aggregate logs for search.
My use case for it is around discoverability. Log search is fine if I’m just looking for something specific. That said, if it’s something else targeted and I am wandering around looking for possible issues, it’s really unintuitive.
For how long have I used the solution?
I've used the solution for more than eight years.
What do I think about the stability of the solution?
What about the implementation team?
We always implement the solution in-house.
Which deployment model are you using for this solution?
Private Cloud
Capable of pinpointing warnings and errors in logs and provide detailed context
What is our primary use case?
Our primary use case for Datadog is to monitor, analyze, and optimize the performance and health of our applications and infrastructure.
We leverage its logging, metrics, and tracing capabilities to pinpoint issues, track system performance, and improve overall reliability.
Datadog’s ability to provide real-time insights and alerting on key metrics helps us quickly address issues, ensuring smooth operations. It’s integral for visibility across our microservices architecture and cloud environments.
How has it helped my organization?
Datadog has been incredibly valuable to our organization. Its ability to pinpoint warnings and errors in logs and provide detailed context is essential for troubleshooting.
The platform's request tracing feature offers comprehensive insights into user flows, allowing us to quickly identify issues and optimize performance.
Additionally, Datadog's real-time monitoring and alerting capabilities help us proactively manage system health, ensuring operational efficiency across our applications and infrastructure.
What is most valuable?
Being able to filter requests by latency is invaluable, as it provides immediate insight into which endpoints require further analysis and optimization. This feature helps us quickly identify performance bottlenecks and prioritize improvements.
Additionally, the ability to filter requests by user email is extremely useful for tracking down user-specific issues faster. It streamlines the troubleshooting process and enables us to provide more targeted support to individual users, improving overall customer satisfaction.
What needs improvement?
The query performance could be improved, particularly when handling large datasets, as slower response times can hinder efficiency.
Additionally, the interface can sometimes feel overwhelming, with so much happening at once, which may discourage users from exploring new features.
Simplifying the layout or providing clearer guidance could enhance user experience. Any improvements related to query optimization would be highly beneficial, as it would further streamline workflows and boost productivity.
For how long have I used the solution?
I've used the solution for five years.
Good logging, easy to find issues, and saves time
What is our primary use case?
We use the solution for APM, AWS, Lambda, logging, and infrastructure. We have many different things all over AWS, and having one place to look is great.
We have all sorts of different AWS things out there that are in C# and Node. Having a single place to log and APM into is very important to us.
Keeping track of the cloud infrastructure is also important. We have Lambda, containers, EC2, etc.
Having a super simple interface to filter the searching for APM and logging is great. It is super easy to show people how to use. This is super important to us.
How has it helped my organization?
Finding issues quickly is super important. Being able to create dashboards and alert on issues.
Having the ability to create dashboards has really taught us how to utilize the searching part of the system. We are able to share them, and build upon them so easily. Many iterations later people are putting some solid information out there.
Alerting is also important to us. We have set up many alerts that help us spot issues in the platform before they become bigger issues. This has enabled my teams to use incidents and address the issues so they are no longer problems.
What is most valuable?
Alerting on running systems is very helpful. Finding issues is quick. We have one place for logging, searching through. Being able to save these and reference them in the future and build upon them.
The logging in general is one of my favorite features. The search is so straight forward and easy to use. Just being able to click on a field and add it to search has taught me so much about the interface, It might not be as useful without a shortcut like that to teach me the system. We have Cloudflare logs in there, and I have no idea sometimes how to filter on such a buried piece of JSON. That is where the interface helps me by clicking on the add to search I get what I need.
What needs improvement?
The "Pager Duty" replacement is something we are very interested in. We only really use pager duty to call the team when things are down.
I love to have some DD guru come in and do a department training directly at our setup. We would love to have someone come in and show us the things we could do better within our current setup.
Also saving a bit of cash would also help if there are things we are doing that are costing us. It's a big enough tool that it is tough to have someone dedicated to manage.
For how long have I used the solution?
I've used the solution for a bit over a year at this point.
What do I think about the stability of the solution?
The stability seems good here too.
What do I think about the scalability of the solution?
Scalability seems good to me. I have no complaints
How are customer service and support?
I get answers from our contact, and one team member did reach out. It went well.
How would you rate customer service and support?
Which solution did I use previously and why did I switch?
We used Loggly.
We switched because we wanted an all-in-one tool
How was the initial setup?
Some parts of our setup were tough. Some Windows container setups cost us a lot of time.
The AWS infrastructure was tough to fully turn on due to the large cost of everything being run.
What about the implementation team?
We handled the setup ourselves in-house.
What was our ROI?
This cost us more overall. ROI is hard to sell. That said, I can find issues way faster and see what is going on in my entire platform. I pay back the cost every month with productivity.
What's my experience with pricing, setup cost, and licensing?
It is going to cost you more than you think to keep everything running. We saw value in the one-for-all solution, however, it came at a premium to what we were paying.
Which other solutions did I evaluate?
We did evaluate Dynatrace.
Which deployment model are you using for this solution?
Public Cloud
If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?
Amazon Web Services (AWS)
Customizable alerts, good dashboards, and improves reliability
What is our primary use case?
We have several teams and several different projects, all working in tandem, so there are a lot of logs and monitoring that need to be done. We use Datadog mostly for alerting when things go down.
We also have several dashboards to keep track of critical operations and to make sure things are running without issues. The Slack messaging is essential in our workflow in letting us know when an alert is triggered. I also appreciate all the graphs you can make, as it gives our team a good overview of how our services are doing.
How has it helped my organization?
It has improved our reliability and our time to get back up from an outage. By creating an alert and then messaging a Slack channel, we know when something goes down fairly fast. This, in turn, improves our response time to swarm on an issue without it affecting customers. The graphs have also been useful to demonstrate to higher-ups how our services are performing, allowing them to make more informed decisions when it comes to the team.
What is most valuable?
The alerts are the most valuable. Having alerts have saved us countless times in the past and is essentially what we use data dog for.
I like how we can customize alerts, and when alerts have become too noisy, we turn their threshold down fairly easily. This is also the case when alerts should be notifying us more often.
I also like the graphs and how customizable they are. It allows us to create a nice-looking dashboard with all sorts of information relating to our project. This gives us a quick overview of how things are going.
What needs improvement?
It's not that straightforward when creating an alert. The syntax is a little confusing. I guess that the trade-off is customizability. But it would be nice to have a click-and-drag kind of way when creating an alert. So, if someone who isn't so familiar with Datadog or tech in general wanted to create an alert, they wouldn't need to know the syntax.
It would also be great if AI could be used to generate alerts and graphs. I could write a short prompt, and then the AI could auto-generate alerts and graphs for me.
For how long have I used the solution?
I've used the solution for more than two years.