Capable of pinpointing warnings and errors in logs and provide detailed context
What is our primary use case?
Our primary use case for Datadog is to monitor, analyze, and optimize the performance and health of our applications and infrastructure.
We leverage its logging, metrics, and tracing capabilities to pinpoint issues, track system performance, and improve overall reliability.
Datadog’s ability to provide real-time insights and alerting on key metrics helps us quickly address issues, ensuring smooth operations. It’s integral for visibility across our microservices architecture and cloud environments.
How has it helped my organization?
Datadog has been incredibly valuable to our organization. Its ability to pinpoint warnings and errors in logs and provide detailed context is essential for troubleshooting.
The platform's request tracing feature offers comprehensive insights into user flows, allowing us to quickly identify issues and optimize performance.
Additionally, Datadog's real-time monitoring and alerting capabilities help us proactively manage system health, ensuring operational efficiency across our applications and infrastructure.
What is most valuable?
Being able to filter requests by latency is invaluable, as it provides immediate insight into which endpoints require further analysis and optimization. This feature helps us quickly identify performance bottlenecks and prioritize improvements.
Additionally, the ability to filter requests by user email is extremely useful for tracking down user-specific issues faster. It streamlines the troubleshooting process and enables us to provide more targeted support to individual users, improving overall customer satisfaction.
What needs improvement?
The query performance could be improved, particularly when handling large datasets, as slower response times can hinder efficiency.
Additionally, the interface can sometimes feel overwhelming, with so much happening at once, which may discourage users from exploring new features.
Simplifying the layout or providing clearer guidance could enhance user experience. Any improvements related to query optimization would be highly beneficial, as it would further streamline workflows and boost productivity.
For how long have I used the solution?
I've used the solution for five years.
Good logging, easy to find issues, and saves time
What is our primary use case?
We use the solution for APM, AWS, Lambda, logging, and infrastructure. We have many different things all over AWS, and having one place to look is great.
We have all sorts of different AWS things out there that are in C# and Node. Having a single place to log and APM into is very important to us.
Keeping track of the cloud infrastructure is also important. We have Lambda, containers, EC2, etc.
Having a super simple interface to filter the searching for APM and logging is great. It is super easy to show people how to use. This is super important to us.
How has it helped my organization?
Finding issues quickly is super important. Being able to create dashboards and alert on issues.
Having the ability to create dashboards has really taught us how to utilize the searching part of the system. We are able to share them, and build upon them so easily. Many iterations later people are putting some solid information out there.
Alerting is also important to us. We have set up many alerts that help us spot issues in the platform before they become bigger issues. This has enabled my teams to use incidents and address the issues so they are no longer problems.
What is most valuable?
Alerting on running systems is very helpful. Finding issues is quick. We have one place for logging, searching through. Being able to save these and reference them in the future and build upon them.
The logging in general is one of my favorite features. The search is so straight forward and easy to use. Just being able to click on a field and add it to search has taught me so much about the interface, It might not be as useful without a shortcut like that to teach me the system. We have Cloudflare logs in there, and I have no idea sometimes how to filter on such a buried piece of JSON. That is where the interface helps me by clicking on the add to search I get what I need.
What needs improvement?
The "Pager Duty" replacement is something we are very interested in. We only really use pager duty to call the team when things are down.
I love to have some DD guru come in and do a department training directly at our setup. We would love to have someone come in and show us the things we could do better within our current setup.
Also saving a bit of cash would also help if there are things we are doing that are costing us. It's a big enough tool that it is tough to have someone dedicated to manage.
For how long have I used the solution?
I've used the solution for a bit over a year at this point.
What do I think about the stability of the solution?
The stability seems good here too.
What do I think about the scalability of the solution?
Scalability seems good to me. I have no complaints
How are customer service and support?
I get answers from our contact, and one team member did reach out. It went well.
How would you rate customer service and support?
Which solution did I use previously and why did I switch?
We used Loggly.
We switched because we wanted an all-in-one tool
How was the initial setup?
Some parts of our setup were tough. Some Windows container setups cost us a lot of time.
The AWS infrastructure was tough to fully turn on due to the large cost of everything being run.
What about the implementation team?
We handled the setup ourselves in-house.
What was our ROI?
This cost us more overall. ROI is hard to sell. That said, I can find issues way faster and see what is going on in my entire platform. I pay back the cost every month with productivity.
What's my experience with pricing, setup cost, and licensing?
It is going to cost you more than you think to keep everything running. We saw value in the one-for-all solution, however, it came at a premium to what we were paying.
Which other solutions did I evaluate?
We did evaluate Dynatrace.
Which deployment model are you using for this solution?
Public Cloud
If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?
Amazon Web Services (AWS)
Improved time to discovery and resolution but needs better consumption visibility
What is our primary use case?
The product monitors multiple systems, from customer interactions on our web applications down to the database and all layers in between. RUM, APM, logging, and infrastructure monitoring are all surfaced into single dashboards.
We initially started with application logs and generated long-term business metrics out of critical logs. We have turned those metrics and logs into a collection of alerts integrated into our pager system. As we have evolved, we have also used APM and RUM data to trigger additional alerts.
How has it helped my organization?
The solution has surfaced how integrated our applications really are and helps us track calls from the top down, identifying slowness and errors all through the call stack.
The biggest improvement we have seen is our time to discovery and resolution. As Datadog has improved, and we add new features, the depth and clarity we get from top to bottom has been excellent. Our engineering teams have quickly adopted many features within Datadog, and are quick to build out their own dashboards and alerts. This has also led to a rapid sprawl when left unchecked.
What is most valuable?
We started with application logs and have expanded over the years to include infrastructure, APM, and now RUM. All of these tools have been incredibly valuable in their own sphere. The huge value is tying all of the data points together.
Logging was the first tool we started with years ago, replacing our ELK stack. It was the easiest to get in place, and our engineers quickly embraced the tools. Several critical dashboards were created years ago and are still in use today. Over time, we have shifted from verbose logs and matured into APM and RUM. That has helped us focus on fine-tuning the performance of our applications.
What needs improvement?
We need better visibility into our consumption rate, which is tied to our commit levels. We would love to see a % consumed and alert us if we are over budget before getting an overage charge 20 days into the month.
The biggest complaint we hear comes from the cost of the tool. It is pretty easy to accidentally consume a lot of extra data. Unless you watch everything come in almost daily, you could be in for a big surprise.
We utilize the Datadog estimated usage metrics to build out alerts and dashboards. The usage and cost system page still doesn't tie into our committed spending - it would be wonderful to see the monthly burn rate on any given day.
For how long have I used the solution?
I've used the solution for six years.
What do I think about the stability of the solution?
There have not been as many outages in the past year. We also haven't been jumping into the new features as quickly as they come out. We may be working on more stable products.
What do I think about the scalability of the solution?
It has scaled up to meet our needs pretty well. Over the years, we have only managed to trigger internal DataDog alerts once or twice by misconfiguring a metric and spiralling out of control with costs.
How are customer service and support?
Support has been lacking. Opening a chat with the tech support rep of the day is always a gamble. We are looking into working with third-party support because it has been so rough over the years.
How would you rate customer service and support?
Which solution did I use previously and why did I switch?
We used the ELK stack for logging and monitoring and AppDynamics for APM.
How was the initial setup?
The initial setup for new teams has become easier over the years. We are increasing our adoption rate as we shift our technology to more cloud-native tools. Datadog has supported easy implementation by simply adding a package to the app.
They have really focused on a lot of out-of-the-box functionality, but the real fun happens as you dive deeper into the configuration. We have also begun adapting open telemetry standards. This has kept us from going too deep into vendor-specific implementations.
What about the implementation team?
We did the initial setup via an in-house team.
What was our ROI?
As long as we stay on top of our consumption mid-month, it has been worth it. However, the few engineers we have who are dedicated to playing whack-a-mole with the growing spending could be better utilized in teaching best practices to new users. I suppose our implementation of the rapidly changing tools over the years has led to a fair amount of technical debt.
What's my experience with pricing, setup cost, and licensing?
It is quite easy to set up any specific tool, but to take advantage of the full visibility it offers, you need to instrument across the board—which can be time-consuming. Be careful about how each tool is billed, and watch your consumption like a hawk.
Which other solutions did I evaluate?
What other advice do I have?
It's a very powerful tool, with lots of new features coming, but you certainly will pay for what you get.
Which deployment model are you using for this solution?
Public Cloud
Very good custom metrics, dashboards, and alerts
What is our primary use case?
Our primary use case for Datadog involves utilizing its dashboards, monitors, and alerts to monitor several key components of our infrastructure.
We track the performance of AWS-managed Airflow pipelines, focusing on metrics like data freshness, data volume, pipeline success rates, and overall performance.
In addition, we monitor Looker dashboard performance to ensure data is processed efficiently. Database performance is also closely tracked, allowing us to address any potential issues proactively. This setup provides comprehensive observability and ensures that our systems operate smoothly.
How has it helped my organization?
Datadog has significantly improved our organization by providing a centralized platform to monitor all our key metrics across various systems. This unified observability has streamlined our ability to oversee infrastructure, applications, and databases from a single location.
Furthermore, the ability to set custom alerts has been invaluable, allowing us to receive real-time notifications when any system degradation occurs. This proactive monitoring has enhanced our ability to respond swiftly to issues, reducing downtime and improving overall system reliability. As a result, Datadog has contributed to increased operational efficiency and minimized potential risks to our services.
What is most valuable?
The most valuable features we’ve found in Datadog are its custom metrics, dashboards, and alerts. The ability to create custom metrics allows us to track specific performance indicators that are critical to our operations, giving us greater control and insights into system behavior.
The dashboards provide a comprehensive and visually intuitive way to monitor all our key data points in real-time, making it easier to spot trends and potential issues. Additionally, the alerting system ensures we are promptly notified of any system anomalies or degradations, enabling us to take immediate action to prevent downtime.
Beyond the product features, Datadog’s customer support has been incredibly timely and helpful, resolving any issues quickly and ensuring minimal disruption to our workflow. This combination of features and support has made Datadog an essential tool in our environment.
What needs improvement?
One key improvement we would like to see in a future Datadog release is the inclusion of certain metrics that are currently unavailable. Specifically, the ability to monitor CPU and memory utilization of AWS-managed Airflow workers, schedulers, and web servers would be highly beneficial for our organization. These metrics are critical for understanding the performance and resource usage of our Airflow infrastructure, and having them directly in Datadog would provide a more comprehensive view of our system’s health. This would enable us to diagnose issues faster, optimize resource allocation, and improve overall system performance. Including these metrics in Datadog would greatly enhance its utility for teams working with AWS-managed Airflow.
For how long have I used the solution?
I've used the solution for four months.
What do I think about the stability of the solution?
The stability of Datadog has been excellent. We have not encountered any significant issues so far.
The platform performs reliably, and we have experienced minimal disruptions or downtime. This stability has been crucial for maintaining consistent monitoring and ensuring that our observability needs are met without interruption.
What do I think about the scalability of the solution?
Datadog is generally scalable, allowing us to handle and display thousands of custom metrics efficiently. However, we’ve encountered some limitations in the table visualization view, particularly when working with around 10,000 data points. In those cases, the search functionality doesn’t always return all valid results, which can hinder detailed analysis.
How are customer service and support?
Datadog's customer support plays a crucial role in easing the initial setup process. Their team is proactive in assisting with metric configuration, providing valuable examples, and helping us navigate the setup challenges effectively. This support significantly mitigates the complexity of the initial setup.
Which solution did I use previously and why did I switch?
We used New Relic before.
How was the initial setup?
The initial setup of Datadog can be somewhat complex, primarily due to the learning curve associated with configuring each metric field correctly for optimal data visualization. It often requires careful attention to detail and a good understanding of each option to achieve the desired graphs and insights
What about the implementation team?
We implemented the solution in-house.
Good centralized pipeline tracking and error logging with very good performance
What is our primary use case?
Our primary use case is custom and vendor-supplied web application log aggregation, performance tracing and alerting.
We run a mix of AWS EC2, Azure serverless, and colocated VMWare servers to support higher education web applications.
Managing a hybrid multi-cloud solution across hundreds of applications is always a challenge.
Datadog agents on each web host and native integrations with GitHub, AWS, and Azure get all of our instrumentation and error data in one place for easy analysis and monitoring.
How has it helped my organization?
Using Datadog across all of our apps, we were able to consolidate a number of alerting and error-tracking apps, and Datadog ties them all together in cohesive dashboards.
Whether the app is vendor-supplied or we built it ourselves, the depth of tracing, profiling, and hooking into logs is all obtainable and tunable. Both legacy .NET Framework and Windows Event Viewer and cutting-edge .NET Core with streaming logs all work.
The breadth of coverage for any app type or situation is really incredible. It feels like there's nothing we can't monitor.
What is most valuable?
When it comes to Datadog, several features have proven particularly valuable. For example, the centralized pipeline tracking and error logging provide a comprehensive view of our development and deployment processes, making it much easier to identify and resolve issues quickly.
Synthetic testing has been a game-changer, allowing us to catch potential problems before they impact real users.
Real user monitoring gives us invaluable insights into actual user experiences, helping us prioritize improvements where they matter most. And the ability to create custom dashboards has been incredibly useful, allowing us to visualize key metrics and KPIs in a way that makes sense for different teams and stakeholders.
Together, these features form a powerful toolkit that helps us maintain high performance and reliability across our applications and infrastructure, ultimately leading to better user satisfaction and more efficient operations.
What needs improvement?
They need an expansion of the Android and IOS apps to provide a simplified CI/CD pipeline history view.
I like the idea of monitoring on the go. That said, it seems the options are still a bit limited out of the box.
While the documentation is very good considering all the frameworks and technology Datadog covers, there are areas - specifically .NET Profiling and Tracing of IIS hosted apps - that need a lot of focus to pick up on the key details needed.
In some cases the screenshots don't match the text as updates are made. I spent longer than I should figuring out how to correlate logs to traces, mostly related to environmental variables.
For how long have I used the solution?
I've used the solution for about three years.
What do I think about the stability of the solution?
We have been impressed with the uptime and clean and light resource usage of the agents.
What do I think about the scalability of the solution?
The solution has been very scalable and very customizable.
How are customer service and support?
Support is always helpful to help us tune our committed costs and alert us when we start spending out of the on-demand budget.
Which solution did I use previously and why did I switch?
We used a mix of a custom error email system, SolarWinds, UptimeRobot, and GitHub actions. We switched to find one platform that could give deep app visibility regardless of Linux or Windows or Container, cloud or on-prem hosted.
How was the initial setup?
The implementation is generally simple. That said, .NET Profiling of IIS and aligning logs to traces and profiles was a challenge.
What about the implementation team?
The solution was implemented in-house.
What was our ROI?
Our ROI has been significant time saved by the development team assessing bugs and performance issues.
What's my experience with pricing, setup cost, and licensing?
Set up live trials to asses cost scaling. Small decisions around how monitors are used can impact cost scaling.
Which other solutions did I evaluate?
NewRelic was considered. LogicMonitor was chosen over Datadog for our network and campus server management use cases.
What other advice do I have?
We are excited to explore the new offerings around LLM further and continue to expand our presence in Datadog.
Which deployment model are you using for this solution?
Hybrid Cloud
If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?
Microsoft Azure
Consolidates alerts, offers comprehensive views, and has synthetic testing
What is our primary use case?
Our primary use case is custom and vendor-supplied web application log aggregation, performance tracing and alerting.
We run a mix of AWS EC2, Azure serverless, and colocated VMWare servers to support higher education web applications.
We're managing a hybrid multi-cloud solution across hundreds of applications, which is always a challenge. There are Datadog agents on each web host, and native integrations with GitHub, AWS, and Azure and that gets all of our instrumentation and error data in one place for easy analysis and monitoring.
How has it helped my organization?
Through the use of Datadog across all of our apps, we were able to consolidate a number of alerting and error-tracking apps, and Datadog ties them all together in cohesive dashboards. Whether the app is vendor-supplied or we built it ourselves, the depth of tracing, profiling, and hooking into logs is all obtainable and tunable. Both legacy .NET Framework and Windows Event Viewer and cutting-edge .NET Core with streaming logs all work. The breadth of coverage for any app type or situation is really incredible. It feels like there's nothing we can't monitor.
What is most valuable?
When it comes to Datadog, several features have proven particularly valuable.
The centralized pipeline tracking and error logging provide a comprehensive view of our development and deployment processes, making it much easier to identify and resolve issues quickly.
Synthetic testing has been a game-changer, allowing us to catch potential problems before they impact real users. Real user monitoring gives us invaluable insights into actual user experiences, helping us prioritize improvements where they matter most. And the ability to create custom dashboards has been incredibly useful, allowing us to visualize key metrics and KPIs in a way that makes sense for different teams and stakeholders.
Together, these features form a powerful toolkit that helps us maintain high performance and reliability across our applications and infrastructure, ultimately leading to better user satisfaction and more efficient operations.
What needs improvement?
I'd like to see an expansion of the Android and IOS apps to have a simplified CI/CD pipeline history view.
I like the idea of monitoring on the go, however, it seems the options are still a bit limited out of the box. While the documentation is very good considering all the frameworks and technology Datadog covers, there are areas - specifically .NET Profiling and Tracing of IIS-hosted apps - that need a lot of focus to pick up on the key details needed.
Sometimes, the screenshots don't match the text as updates are made. I spent longer than I should have figured out how to correlate logs to traces, mostly related to environmental variables.
For how long have I used the solution?
I've used the solution for about three years.
What do I think about the stability of the solution?
We have been impressed with the uptime and clean and light resource usage of the agents.
What do I think about the scalability of the solution?
The product is very scalable and very customizable.
How are customer service and support?
Technical support is always helpful to help us tune our committed costs and alert us when we start spending out of the on-demand budget.
Which solution did I use previously and why did I switch?
We used a mix of a custom error email system, SolarWinds, UptimeRobot, and GitHub actions. We switched to find one platform that could give deep app visibility regardless of Linux or Windows or Container, cloud or on-prem hosted.
How was the initial setup?
The setup is generally simple. .NET Profiling of IIS and aligning logs to traces and profiles was a challenge.
What about the implementation team?
We implemented the solution in-house.
What was our ROI?
ROI is reflected in in significant time saved by the development team assessing bugs and performance issues.
What's my experience with pricing, setup cost, and licensing?
Set up live trials to asses cost scaling. Small decisions around how monitors are used can impact cost scaling.
Which other solutions did I evaluate?
NewRelic was considered. LogicMonitor was chosen over Datadog for our network and campus server management use cases.
What other advice do I have?
We're excited to explore the new offerings around LLM further and continue to expand our presence in Datadog.
Which deployment model are you using for this solution?
Hybrid Cloud
If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?
Microsoft Azure
Good synthetic testing, centralized pipeline tracking and error logging
What is our primary use case?
Our primary use case is custom and vendor-supplied web application log aggregation, performance tracing and alerting.
We run a mix of AWS EC2, Azure serverless, and colocated VMWare servers to support higher education web applications.
Managing a hybrid multi-cloud solution across hundreds of applications is always a challenge. Datadog agents on each web host and native integrations with GitHub, AWS, and Azure get all of our instrumentation and error data in one place for easy analysis and monitoring.
How has it helped my organization?
Through the use of Datadog across all of our apps, we were able to consolidate a number of alerting and error-tracking apps, and Datadog ties them all together in cohesive dashboards. Whether the app is vendor-supplied or we built it ourselves, the depth of tracing, profiling, and hooking into logs is all obtainable and tunable. Both legacy .NET Framework and Windows Event Viewer and cutting-edge .NET Core with streaming logs all work. The breadth of coverage for any app type or situation is really incredible. It feels like there's nothing we can't monitor.
What is most valuable?
When it comes to Datadog, several features have proven particularly valuable.
The centralized pipeline tracking and error logging provide a comprehensive view of our development and deployment processes, making it much easier to identify and resolve issues quickly.
Synthetic testing has been a game-changer, allowing us to catch potential problems before they impact real users. Real user monitoring gives us invaluable insights into actual user experiences, helping us prioritize improvements where they matter most. And the ability to create custom dashboards has been incredibly useful, allowing us to visualize key metrics and KPIs in a way that makes sense for different teams and stakeholders.
Together, these features form a powerful toolkit that helps us maintain high performance and reliability across our applications and infrastructure, ultimately leading to better user satisfaction and more efficient operations.
What needs improvement?
I'd like to see an expansion of the Android and IOS apps to have a simplified CI/CD pipeline history view. I like the idea of monitoring on the go, however, it seems the options are still a bit limited out of the box.
While the documentation is very good considering all the frameworks and technology Datadog covers, there are areas - specifically .NET Profiling and Tracing of IIS-hosted apps - that need a lot of focus to pick up on the key details needed. In some cases the screenshots don't match the text as updates are made. I feel I spent longer than I should figuring out how to correlate logs to traces, mostly related to environmental variables.
For how long have I used the solution?
I've used the solution for about three years.
What do I think about the stability of the solution?
We have been impressed with the uptime and clean and light resource usage of the agents.
What do I think about the scalability of the solution?
The solution was very scalable and very customizable.
How are customer service and support?
Sales service is always helpful in tuning our committed costs and alerting us when we start spending outside the on-demand budget.
Which solution did I use previously and why did I switch?
We used a mix of a custom error email system, SolarWinds, UptimeRobot, and GitHub actions. We switched to find one platform that could give deep app visibility regardless of Linux, Windows, Container, cloud or on-prem hosted.
How was the initial setup?
The setup is generally simple. That said, .NET Profiling of IIS and aligning logs to traces and profiles was a challenge.
What about the implementation team?
The solution was iImplemented in-house.
What was our ROI?
I'd count our ROI as significant time saved by the development team assessing bugs and performance issues.
What's my experience with pricing, setup cost, and licensing?
It's a good idea to set up live trials to asses cost scaling. Small decisions around how monitors are used can have big impacts on cost scaling.
Which other solutions did I evaluate?
NewRelic was considered. LogicMonitor was chosen over Datadog for our network and campus server management use cases.
What other advice do I have?
We are excited to dig further into the new offerings around LLM and continue to grow our footprint in Datadog.
Which deployment model are you using for this solution?
Hybrid Cloud
If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?
Microsoft Azure
Easy dashboard creation and alarm monitoring with a good ROI
What is our primary use case?
We use the solution to monitor production service uptime/downtime, latency, and log storage.
Our entire monitoring infrastructure runs off Datadog, so all our alarms are configured with it. We also use it for tracing API performance; what are the biggest regression points.
Finally we use it to compare performance on SEO metrics vs competitors. This is a primary use case as SEO dictates our position from google traffic which is a large portion of our customer view generation so it is a vital part of the business we rely on datadog for.
How has it helped my organization?
The product improved the organization primarily by providing consistent data with virtually zero downtime. This was a problem we had with an old provider. It also made it easy to transition an otherwise massive migration involving hundreds of alarms.
The training provided was crucial, along with having a dedicated team that can forward our requests to and from Datadog efficiently. Without that, we may have never transitioned to Datadog in the first place since it is always hard to lead a migration for an entire company.
What is most valuable?
The API tracing has been massive for debugging latency regressions and how to improve the performance of our least performant APIs. Through tracing, we managed to find the slowest step of an API, improve its latency, and iterate on the process until we had our desired timings. This is important for improving our SEO as LCP, INP are directly taking from the numbers we see on Datadog for our API timings.
The ease of dashboard creation and alarm monitoring has helped us not only stay competitive but be industry leaders in performance.
What needs improvement?
The product can be improved by allowing the grouping of APIs to add variables. That way, any API with a unique ID could be grouped together.
Furthermore, SEO monitoring has been crucial for us but also a difficult part to set up as comparing alarms between us and competitors is a tough feat. Data is not always consistent so we have been toying and experimenting with removing the noise of datadog but its been taking a while.
Finally, Datadog should have a feature that reports stale alarms based on activity.
For how long have I used the solution?
I've used the solution for six months.
What do I think about the stability of the solution?
Its very stable and we have not experienced an issue with downtime on Datadog.
What do I think about the scalability of the solution?
Datadog works well for scalability as volume has not seemed to slow.
How are customer service and support?
We haven't talked to the support team.
How would you rate customer service and support?
Which solution did I use previously and why did I switch?
We switched to Datadog as we used to have a provider that had very inconsistent logging. Our alarms would often not fire since our services were not working since the provider had a logging problem.
How was the initial setup?
The initial setup was somewhat complex due to the built-in monitoring with services. This is not always super comprehensive and has to be studied as opposed to other metrics platforms that just service all your endpoints, which you can trace them with Grafana.
What about the implementation team?
We implemented the solution through an in-house team.
What was our ROI?
What's my experience with pricing, setup cost, and licensing?
Users must try to understand the way Datadog alarms work off the bat so that they can minimize the requirements for expensive features like custom metrics.
It can sometimes be tempting to use them; however, it is not always necessary as you migrate to Datalog, as they are a provider that treats alarms somewhat differently than you may be used to.
Which other solutions did I evaluate?
We have evaluated New Relic, Grafana, Splunk, and many more in our quest to find the best monitoring provider.
Which deployment model are you using for this solution?
Hybrid Cloud
Customizable alerts, good dashboards, and improves reliability
What is our primary use case?
We have several teams and several different projects, all working in tandem, so there are a lot of logs and monitoring that need to be done. We use Datadog mostly for alerting when things go down.
We also have several dashboards to keep track of critical operations and to make sure things are running without issues. The Slack messaging is essential in our workflow in letting us know when an alert is triggered. I also appreciate all the graphs you can make, as it gives our team a good overview of how our services are doing.
How has it helped my organization?
It has improved our reliability and our time to get back up from an outage. By creating an alert and then messaging a Slack channel, we know when something goes down fairly fast. This, in turn, improves our response time to swarm on an issue without it affecting customers. The graphs have also been useful to demonstrate to higher-ups how our services are performing, allowing them to make more informed decisions when it comes to the team.
What is most valuable?
The alerts are the most valuable. Having alerts have saved us countless times in the past and is essentially what we use data dog for.
I like how we can customize alerts, and when alerts have become too noisy, we turn their threshold down fairly easily. This is also the case when alerts should be notifying us more often.
I also like the graphs and how customizable they are. It allows us to create a nice-looking dashboard with all sorts of information relating to our project. This gives us a quick overview of how things are going.
What needs improvement?
It's not that straightforward when creating an alert. The syntax is a little confusing. I guess that the trade-off is customizability. But it would be nice to have a click-and-drag kind of way when creating an alert. So, if someone who isn't so familiar with Datadog or tech in general wanted to create an alert, they wouldn't need to know the syntax.
It would also be great if AI could be used to generate alerts and graphs. I could write a short prompt, and then the AI could auto-generate alerts and graphs for me.
For how long have I used the solution?
I've used the solution for more than two years.
A great tool with an easy setup and helpful error logs
What is our primary use case?
We currently have an error monitor to monitor errors on our prod environment. Once we hit a certain threshold, we get an alert on Slack. This helps address issues the moment they happen before our users notice.
We also utilize synthetic tests on many pages on our site. They're easy to set up and are great for pinpointing when a bug is shipped, but they may take down a less visited page that we aren't immediately aware of. It's a great extra check to make sure the code we ship is free of bugs.
How has it helped my organization?
The synthetic tests have been invaluable. We use them to check various pages and ensure functionality across multiple areas. Furthermore, our error monitoring alerts have been crucial in letting us know of problems the moment they pop up.
Datadog has been a great tool, and all of our teams utilize many of its features. We have regular mob sessions where we look at our Datadog error logs and see what we can address as a team. It's been great at providing more insight into our users and logging errors that can be fixed.
What is most valuable?
The error logs have been super helpful in breaking down issues affecting our users. Our monitors let us know once we hit a certain threshold as well, which is good for momentary blips and issues with third-party providers or rollouts that we have in the works. Just last week, we had a roll-out where various features were broken due to a change in our backend API. Our Datadog logs instantly notified us of the issues, and we could troubleshoot everything much more easily than just testing blind. This was crucial to a successful rollout.
What needs improvement?
I honestly can't think of anything that can be improved. We've started using more and more features from our Datadog account and are really grateful for all of the different ways we can track and monitor our site.
We did have an issue where a synthetic test was set up before the holiday break, and we were quickly charged a great amount. Our team worked with Datadog, and they were able to help us out since it was inadvertent on our end and was a user error. That was greatly appreciated and something that helped start our relationship with the Datadog team.
For how long have I used the solution?
We've been using Datadog for several months. We started with the synthetic tests and now use It for error handling and in many other ways.
What do I think about the stability of the solution?
Stability has been great. We've had no issues so far.
What do I think about the scalability of the solution?
The solution is very easy to scale. We've used it on multiple clients.
How are customer service and support?
We had a dev who had set up a synthetic test that was running every five minutes in every single region over the holiday break last year. The Datadog team was great and very understanding and we were able to work this out with them.
How would you rate customer service and support?
Which solution did I use previously and why did I switch?
We didn't have any previous solution. At a previous company, I've used Sentry. However, I also find Datadog to be much easier, plus the inclusion of synthetic tests is awesome.
How was the initial setup?
The documentation was great and our setup was easy.
What about the implementation team?
We implemented the solution in-house.
What was our ROI?
This has had a great ROI as we've been able to address critical bugs that have been found via our Datadog tools.
What's my experience with pricing, setup cost, and licensing?
The setup cost was minimal. The documentation is great and the product is very easy to set up.
Which other solutions did I evaluate?
We also looked at other providers and settled on Datadog. It's been great to use across all our clients.
Which deployment model are you using for this solution?
Private Cloud